Using external storrs.

Often it is useful to retrieve data from an external resource (especially websites). The way this works is:

  1. We do a key lookup on the storr; if that succeeds (i.e. it maps to a hash) continue as normal..

  2. If the lookup fails, pass the key (and namespace) to a “hook” function that generates an R object (in any way).

This is in some ways a variant on the memoisation pattern; if the key refers to a set of arguments to a long running function we get something like memoisation (see the bottom of this file).

As an example, this vignette will download some DESCRIPTION files from github, using the name of the repository as the key.

The first step is writing a hook function; this is a function with arguments (key, namespace) that returns an R object. For packages stored in the root directory of a repository we can build URLs of the form

https://raw.githubusercontent.com/<username>/<repo>/master/DESCRIPTION

So if the key is a username/repo pair and we ignore namespace we can write a function:

fetch_hook_gh_description <- function(key, namespace) {
  if (!isTRUE(unname(capabilities("libcurl")))) {
    stop("This vignette requires libcurl support in R to run")
  }
  fmt <- "https://raw.githubusercontent.com/%s/master/DESCRIPTION"
  path <- tempfile("gh_description_")
  on.exit(file.remove(path))
  code <- download.file(sprintf(fmt, key), path, mode = "wb")
  if (code != 0L) {
    stop("Error downloading file")
  }
  as.list(read.dcf(path)[1, ])
}

This function downloads the requested DESCRIPTION file into a temporary file (which it promises to delete later using on.exit), checks that the download was successful, then reads in the downloaded file and converts it into a list.

The httr and curl packages make this a little easier to do with authorisation so that this would work for private repositories by using a token.

With this in place, we can build a storr:

st <- storr::storr_external(storr::driver_environment(),
                            fetch_hook_gh_description)

The first argument here is a storr driver (i.e., a driver_ function). If you have a storr that you want to use, pass it as st$driver to extract the underlying driver (and share storage with your existing storr).

As with other storr creation functions, you can set the default namespace using the default_namespace argument.

The returned object is exactly the same as a usual storr except that the get method has changed (this is done by inheritence). The get method only behaves differently when the object is not present in the storr, in which case it will try to fetch the object and insert it into the storr.

At first there is nothing in here:

st$list()
## character(0)

But we can still get things from the storr:

d <- st$get("richfitz/storr")

Once a key has been fetched, it will be retrieved locally:

identical(st$get("richfitz/storr"), d)
## [1] TRUE

And it will be present within the storr, as shown by list:

st$list()
## [1] "richfitz/storr"

If an external resource cannot be located, storr will throw an error of class KeyErrorExternal:

tryCatch(st$get("richfitz/no_such_repo"),
         KeyErrorExternal = function(e)
           message(sprintf("** Repository %s not found", e$key)))
## Warning in download.file(sprintf(fmt, key), path, mode = "wb"): cannot open URL
## 'https://raw.githubusercontent.com/richfitz/no_such_repo/master/DESCRIPTION':
## HTTP status was '404 Not Found'
## Warning in file.remove(path): cannot remove file '/tmp/Rtmp9eO9DJ/
## gh_description_1541425c7806', reason 'No such file or directory'
## ** Repository richfitz/no_such_repo not found

This would happen for all errors, including lack of internet connectivity, corrupt file downloads, etc. The original error will be returned as the $e element of the error if you need to distinguish between types of failure. The KeyErrorExternal is also a KeyError class, so code that catches KeyErrors will still work as expected.

For more details on storr exception handling, see the storr vignette (vignette("storr", package = "storr"))

Note that if you want to persist the storage of the descriptions you would need to mangle the key:

st_rds <- st$export(storr::storr_rds(tempfile(), mangle_key = TRUE))
st_rds$list()
## [1] "richfitz/storr"
st_rds$get("richfitz/storr")$Version
## [1] "1.2.4"

The st_rds storr does not include the fetch hook; it is a plain storr.

st_rds$destroy()

Memoisation

The external storr can support a form of memoisation, though it might be simpler to implement this directly (see below).

Suppose you have some expensive function f(a, b)

f <- function(a, b) {
  message(sprintf("Computing f(%.3f, %.3f)", a, b))
  ## ...expensive computation here...
  list(a, b)
}

and a set of parameters to run the function over, with each parameter set (row) associated with an id:

pars <- data.frame(id = as.character(1:10), a = runif(10), b = runif(10),
                   stringsAsFactors = FALSE)

The hook here simply looks the parameters up and arranges to run them:

hook <- function(key, namespace) {
  p <- pars[match(key, pars$id), -1]
  f(p$a, p$b)
}

st <- storr::storr_external(storr::driver_environment(), hook)

The first time the result is retrieved the message will be printed (the function is evaluated)

x <- st$get("1")
## Computing f(0.625, 0.210)

The second time, it will not be as the result is retrieved from the storr:

identical(st$get("1"), x)
## [1] TRUE

This idea can be generalised by storing the parameters and the functions in the storr so that we lose the dependency on the global variables:

st <- storr::storr_environment()
st$set("experiment1", pars, namespace = "parameters")
st$set("experiment1", f, namespace = "functions")

hook2 <- function(key, namespace) {
  f <- st$get(namespace, namespace = "functions")
  pars <- st$get(namespace, namespace = "parameters")
  p <- pars[match(key, pars$id), -1]
  f(p$a, p$b)
}

st_use <- storr::storr_external(st$driver, hook2)

x1 <- st_use$get("1", "experiment1")
## Computing f(0.625, 0.210)
x2 <- st_use$get("1", "experiment1")

Memoisation in the style of the memoise package is possible to implement, but is not provided in the package. Implementation is straightforward and will work with any driver:

memoise <- function(f, driver = storr::driver_environment()) {
  force(f)
  st <- storr::storr(driver)
  function(...) {
    ## NOTE: also digesting the inputs as a key here (in addition to
    ## storr's usual digesting of values)
    key <- digest::digest(list(...))
    tryCatch(
      st$get(key),
      KeyError = function(e) {
        ans <- f(...)
        st$set(key, ans)
        ans
      })
  }
}

Here's a function that will print when it is evaluated

f <- function(x) {
  message("computing...")
  x * 2
}

Create the memoised function

g <- memoise(f)

The first time an argument is seen, f() will be run, printing a message

g(1)
## computing...
## [1] 2

Subsequent times will be looked up from the storr:

g(1)
## [1] 2

Storr takes about twice as long as memoise (memoise does a direct key->value mapping rather than going through hashed values because it is the only thing that ever touches its cache). However, the overhead is approximately half of one call to message() so it's not that bad.