Getting started with ARUtools

The ARUtools package aims to make processing of large quantities of acoustic recordings easier through automation of metadata processing and sub-sampling of recordings.

Prior to working on your ARU recordings or meta data you must:

This introduction will walk through the first few steps of extracting the metadata, adding site information, and calculating sunrise and sunset information.

Read file metadata

library(ARUtools)

Let’s use some example data to get started.

head(example_files)
#> [1] "a_BARLT10962_P01_1/P01_1_20200502T050000_ARU.wav"
#> [2] "a_BARLT10962_P01_1/P01_1_20200503T052000_ARU.wav"
#> [3] "a_S4A01234_P02_1/P02_1_20200504T052500_ARU.wav"  
#> [4] "a_S4A01234_P02_1/P02_1_20200505T073000_ARU.wav"  
#> [5] "a_BARLT10962_P03_1/P03_1_20200506T100000_ARU.wav"
#> [6] "a_BARLT11111_P04_1/P04_1_20200506T050000_ARU.wav"

This is a list of hypothetical ARU files from different sites, and using different ARUs. This is fairly messily organized data in that there is no clear structure to the folders and there appear to be unneeded characters in the files. However give the standard structure of site names, ARU ID codes, and datetime stamps, we can extract that information from the file structure alone.

First things first, we’ll clean up the meta data associated with the files.

m <- clean_metadata(project_files = example_files)
#> Extracting ARU info...
#> Extracting Dates and Times...

Because our example files follow the standard formats for Site ID, ARU Id, and date/time, we can extract all the information without having to change any of the default arguments.

m
#> # A tibble: 42 × 8
#>   file_name   type  path  aru_type aru_id site_id date_time           date      
#>   <chr>       <chr> <chr> <chr>    <chr>  <chr>   <dttm>              <date>    
#> 1 P01_1_2020… wav   a_BA… BarLT    BARLT… P01_1   2020-05-02 05:00:00 2020-05-02
#> 2 P01_1_2020… wav   a_BA… BarLT    BARLT… P01_1   2020-05-03 05:20:00 2020-05-03
#> 3 P02_1_2020… wav   a_S4… SongMet… S4A01… P02_1   2020-05-04 05:25:00 2020-05-04
#> 4 P02_1_2020… wav   a_S4… SongMet… S4A01… P02_1   2020-05-05 07:30:00 2020-05-05
#> # ℹ 38 more rows

If you were reading directly from files you would assign a base directory and then have clean_metadata read the files in that folder and sub-folders.

base_directory <- "/path/to/project/files/"
m <- clean_metadata(project_dir = base_directory)

Add coordinates

Next, we want to add our coordinates to this data.

If your data has GPS logs included, they would have been detected in the above step and you could now use g <- clean_gps(m) to create a list of GPS coordinates.

However, many models of ARUs do not have an internal GPS and those that do, may not accurately record the location where the ARU is deployed to. Therefore we recommend that you create a site index file to manually record deployment locations, like this one.

example_sites
#>    Sites Date_set_out Date_removed        ARU    lon    lat Plots Subplot
#> 1  P01_1   2020-05-01   2020-05-03 BARLT10962 -85.03 50.010 Plot1       a
#> 2  P02_1   2020-05-03   2020-05-05   S4A01234 -87.45 52.680 Plot1       a
#> 3  P03_1   2020-05-05   2020-05-06 BARLT10962 -90.38 48.990 Plot2       a
#> 4  P04_1   2020-05-05   2020-05-07 BARLT11111 -85.53 45.000 Plot2       a
#> 5  P05_1   2020-05-06   2020-05-07 BARLT10962 -88.45 51.050 Plot3       b
#> 6  P06_1   2020-05-08   2020-05-09 BARLT10962 -90.08 52.000 Plot1       a
#> 7  P07_1   2020-05-08   2020-05-10   S4A01234 -86.03 50.450 Plot1       a
#> 8  P08_1   2020-05-10   2020-05-11 BARLT10962 -84.45 48.999 Plot2       a
#> 9  P09_1   2020-05-10   2020-05-11   S4A02222 -91.38 45.000 Plot2       a
#> 10 P10_1   2020-05-10   2020-05-11   S4A03333 -90.00 50.010 Plot3       b

While you can simply specify a single date, it is recommended that you use both a start date and an end date for the best matching. This is critical if you are moving your ARUs during a season.

Now let’s clean up this list so we can add these sites to our metadata.

sites <- clean_site_index(example_sites)
#> Error in `clean_site_index()`:
#> ! Problems with data `site_index`:
#> • Column 'site_id' does not exist
#> • Column 'date' does not exist
#> • Column 'aru_id' does not exist
#> • Column 'longitude' does not exist
#> • Column 'latitude' does not exist
#> • See ?clean_site_index

Ooops! We can see right away that clean_site_index() expects the data to be in a particular format. Luckily we can let it know if we’ve used a different format.

sites <- clean_site_index(example_sites,
  name_aru_id = "ARU",
  name_site_id = "Sites",
  name_date_time = c("Date_set_out", "Date_removed"),
  name_coords = c("lon", "lat")
)
#> There are overlapping date ranges
#> • Shifting start/end times to 'noon'
#> • Skip this with `resolve_overlaps = FALSE`

Hmm, that’s an interesting message! This means that some of our deployment dates overlap. ARUtools assumes that if you set out an ARU on a specific day, you probably didn’t set it out at midnight (i.e. the very start of that day). Since we assume you are likely using ARUs for recording in the early morning or late at night, we shift the dates start/end times to noon as an estimate of when the ARU was likely deployed.

If your ARU was deployed at midnight, use resolve_ovelaps = FALSE. Or, if you know the exact time your ARU was deployed, use a date/time rather than just a date in your site index.

sites
#> # A tibble: 10 × 8
#>   site_id aru_id   date_time_start     date_time_end       date_start date_end  
#>   <chr>   <chr>    <dttm>              <dttm>              <date>     <date>    
#> 1 P01_1   BARLT10… 2020-05-01 12:00:00 2020-05-03 12:00:00 2020-05-01 2020-05-03
#> 2 P02_1   S4A01234 2020-05-03 12:00:00 2020-05-05 12:00:00 2020-05-03 2020-05-05
#> 3 P03_1   BARLT10… 2020-05-05 12:00:00 2020-05-06 12:00:00 2020-05-05 2020-05-06
#> 4 P04_1   BARLT11… 2020-05-05 12:00:00 2020-05-07 12:00:00 2020-05-05 2020-05-07
#> # ℹ 6 more rows
#> # ℹ 2 more variables: longitude <dbl>, latitude <dbl>

Note that we’ve lost a couple of non-standard columns: Plots and Subplot.

We can retain these by specifying cols_extra.

sites <- clean_site_index(example_sites,
  name_aru_id = "ARU",
  name_site_id = "Sites",
  name_date_time = c("Date_set_out", "Date_removed"),
  name_coords = c("lon", "lat"),
  name_extra = c("Plots", "Subplot")
)
#> There are overlapping date ranges
#> • Shifting start/end times to 'noon'
#> • Skip this with `resolve_overlaps = FALSE`
sites
#> # A tibble: 10 × 10
#>   site_id aru_id   date_time_start     date_time_end       date_start date_end  
#>   <chr>   <chr>    <dttm>              <dttm>              <date>     <date>    
#> 1 P01_1   BARLT10… 2020-05-01 12:00:00 2020-05-03 12:00:00 2020-05-01 2020-05-03
#> 2 P02_1   S4A01234 2020-05-03 12:00:00 2020-05-05 12:00:00 2020-05-03 2020-05-05
#> 3 P03_1   BARLT10… 2020-05-05 12:00:00 2020-05-06 12:00:00 2020-05-05 2020-05-06
#> 4 P04_1   BARLT11… 2020-05-05 12:00:00 2020-05-07 12:00:00 2020-05-05 2020-05-07
#> # ℹ 6 more rows
#> # ℹ 4 more variables: longitude <dbl>, latitude <dbl>, Plots <chr>,
#> #   Subplot <chr>

We can even be fancy and rename them for consistency by using named vectors.

sites <- clean_site_index(example_sites,
  name_aru_id = "ARU",
  name_site_id = "Sites",
  name_date_time = c("Date_set_out", "Date_removed"),
  name_coords = c("lon", "lat"),
  name_extra = c("plot" = "Plots", "subplot" = "Subplot")
)
#> There are overlapping date ranges
#> • Shifting start/end times to 'noon'
#> • Skip this with `resolve_overlaps = FALSE`
sites
#> # A tibble: 10 × 10
#>   site_id aru_id   date_time_start     date_time_end       date_start date_end  
#>   <chr>   <chr>    <dttm>              <dttm>              <date>     <date>    
#> 1 P01_1   BARLT10… 2020-05-01 12:00:00 2020-05-03 12:00:00 2020-05-01 2020-05-03
#> 2 P02_1   S4A01234 2020-05-03 12:00:00 2020-05-05 12:00:00 2020-05-03 2020-05-05
#> 3 P03_1   BARLT10… 2020-05-05 12:00:00 2020-05-06 12:00:00 2020-05-05 2020-05-06
#> 4 P04_1   BARLT11… 2020-05-05 12:00:00 2020-05-07 12:00:00 2020-05-05 2020-05-07
#> # ℹ 6 more rows
#> # ℹ 4 more variables: longitude <dbl>, latitude <dbl>, plot <chr>,
#> #   subplot <chr>

Now let’s add this site-related information to our metadata.

m <- add_sites(m, sites)
#> Joining by columns `date_time_start` and `date_time_end`
m
#> # A tibble: 42 × 12
#>   file_name   type  path  aru_type aru_id site_id date_time           date      
#>   <chr>       <chr> <chr> <chr>    <chr>  <chr>   <dttm>              <date>    
#> 1 P01_1_2020… wav   a_BA… BarLT    BARLT… P01_1   2020-05-02 05:00:00 2020-05-02
#> 2 P01_1_2020… wav   a_BA… BarLT    BARLT… P01_1   2020-05-03 05:20:00 2020-05-03
#> 3 P02_1_2020… wav   a_S4… SongMet… S4A01… P02_1   2020-05-04 05:25:00 2020-05-04
#> 4 P02_1_2020… wav   a_S4… SongMet… S4A01… P02_1   2020-05-05 07:30:00 2020-05-05
#> # ℹ 38 more rows
#> # ℹ 4 more variables: longitude <dbl>, latitude <dbl>, plot <chr>,
#> #   subplot <chr>

Calculate times to sunrise and sunset

Great! We have all the site-related information to describe that recording.

Now to prepare for our selection procedure, the last thing we need to do is calculate the time to sunrise or sunset.

Here we need to be clear about what timezone the ARU unit was recording times as.

There are two options.

The first option is that all ARUs were set up at home base before deployment. In this case it’s possible they were deployed in a location with a different timezone than what they were recording in. This doesn’t matter, as long as you specify the programmed timezone here. In this case, use tz = "America/Toronto", or whichever time zone was used. Note that timezones must be one of OlsonNames().

The second option is that each ARU unit was set up to record in the local timezone where it was placed. If this is the case, specify tz = "local" and the calc_sun() function will use coordinates to determine local timezones.

(See the Dealing with Timezones vignette for more details).

In our example, let’s assume that the ARUs were set up in each location they were deployed. So we’ll use tz = "local", the default setting.

m <- calc_sun(m)
dplyr::glimpse(m)
#> Rows: 42
#> Columns: 15
#> $ file_name <chr> "P01_1_20200502T050000_ARU.wav", "P01_1_20200503T052000_ARU.…
#> $ type      <chr> "wav", "wav", "wav", "wav", "wav", "wav", "wav", "wav", "wav…
#> $ path      <chr> "a_BARLT10962_P01_1/P01_1_20200502T050000_ARU.wav", "a_BARLT…
#> $ aru_type  <chr> "BarLT", "BarLT", "SongMeter", "SongMeter", "BarLT", "BarLT"…
#> $ aru_id    <chr> "BARLT10962", "BARLT10962", "S4A01234", "S4A01234", "BARLT10…
#> $ site_id   <chr> "P01_1", "P01_1", "P02_1", "P02_1", "P03_1", "P04_1", "P04_1…
#> $ date_time <dttm> 2020-05-02 05:00:00, 2020-05-03 05:20:00, 2020-05-04 05:25:…
#> $ date      <date> 2020-05-02, 2020-05-03, 2020-05-04, 2020-05-05, 2020-05-06,…
#> $ longitude <dbl> -85.03, -85.03, -87.45, -87.45, -90.38, -85.53, -85.53, -88.…
#> $ latitude  <dbl> 50.010, 50.010, 52.680, 52.680, 48.990, 45.000, 45.000, 51.0…
#> $ plot      <chr> "Plot1", "Plot1", "Plot1", "Plot1", "Plot2", "Plot2", "Plot2…
#> $ subplot   <chr> "a", "a", "a", "a", "a", "a", "a", "b", "a", "a", "a", "a", …
#> $ tz        <chr> "America/Toronto", "America/Toronto", "America/Toronto", "Am…
#> $ t2sr      <dbl> -74.933333, -53.216667, -47.250000, 79.616667, 207.133333, -…
#> $ t2ss      <dbl> 479.9500, 498.4167, 483.4167, 606.6833, -685.8833, 486.1833,…

Tada! Now we have a complete set of cleaned metadata associated with each recording.

This is a very simple example and much of the pain in large projects comes from complications, so be sure to check out vignette("customizing") and vignette("spatial") to dig into some of these issues.

Next steps

Now that we have a set of cleaned metadata the next step is to select recordings. To do this using a random sampling approach check out the subsampling article vignette("SubSample").