<%@meta language="R-vignette" content="-------------------------------- %\VignetteIndexEntry{A Future for R: Non-Exportable Objects} %\VignetteAuthor{Henrik Bengtsson} %\VignetteKeyword{R} %\VignetteKeyword{package} %\VignetteKeyword{vignette} %\VignetteKeyword{future} %\VignetteKeyword{promise} %\VignetteEngine{R.rsp::rsp} %\VignetteTangle{FALSE} --------------------------------------------------------------------"%> # <%@meta name="title"%> Certain types of objects are tied to a given R session. Such objects cannot be saved to file by one R process and then later be reloaded in another R process and expected to work correctly. If attempted, we will often get an informative error but not always. For the same reason, these type of objects cannot be exported to another R processes(*) for parallel processing regardless of which parallelization framework we use. We refer to these type of objects as "non-exportable objects". (*) The exception _might be_ when _forked_ processes are used, i.e. `plan(multicore)`. However, such attempts to work around the underlying problem, which is non-exportable objects, should be avoided and considered non-stable. Moreover, such code will fail to parallelize when using other future backends. ## A first example - file connections An example of a non-exportable object is a connection, e.g. a file connection. For instance, if you create a file connection, ```r con <- file("output.log", open = "wb") cat("hello ", file = con) flush(con) readLines("output.log", warn = FALSE) ## [1] "hello " ``` it will not work when used in another R process. If we try, the result is "unknown", e.g. ```r library(future) plan(multisession) f <- future({ cat("world!", file = con); flush(con) }) value(f) ## NULL close(con) readLines("output.log", warn = FALSE) ## [1] "hello " ``` In other words, the output `"world!"` written by the R worker is completely lost. The culprit here is that the connection uses a so called _external pointer_: ```r str(con) ## Classes 'file', 'connection' atomic [1:1] 3 ## ..- attr(*, "conn_id")= ``` which is bound to the main R process and makes no sense to the worker. Ideally, the R process of the worker would detect this and produce an informative error message, but as seen here, that does not always occur. ## Protect against non-exportable objects To help avoiding exporting non-exportable objects by mistakes, which typically happens because a global variable is non-exportable, the future framework provides a mechanism for automatically detecting such objects. To enable it, do: ```r options(future.globals.onReference = "error") f <- future({ cat("world!", file = con); flush(con) }) ## Error: Detected a non-exportable reference ('externalptr') in one of the globals ## ('con' of class 'file') used in the future expression ``` _Comment_: The `future.globals.onReference` options is set to `"ignore"` by default due to the extra overhead `"error"` introduces, which can be significant for very large nested objects. Furthermore, some subclasses of external pointers can be exported without causing problems. ## Packages with non-exportable objects The below table and sections provide a few examples of non-exportable R objects that you may run into when trying to parallelize your code, or simply from trying reload objects saved in a previous R session. _If you identify other cases, please consider [reporting](https://github.com/futureverse/future/issues/) them so they can be documented here and possibly even be fixed._ Package | Examples of non-exportable types or classes :---------------|:------------------------------------------- **arrow** | Table (`externalptr`) **base** | connection (`externalptr`) **bigmemory** | big.matrix (`externalptr`) **cpp11** | E.g. functions created by `cpp_source()` **DBI** | DBIConnection (`externalptr`) **inline** | CFunc (`externalptr` of class DLLHandle) **keras** | keras.engine.sequential.Sequential (`externalptr`), keras.engine.base_layer.Layer (`externalptr`) **magick** | magick-image (`externalptr`) **ncdf4** | ncdf4 (custom reference; _non-detectable_) **parallel** | cluster and cluster nodes (`connection`) **polars** | RPolarsDataFrame (`externalptr`) **raster** | RasterLayer (`externalptr`; _not all_) **Rcpp** | NativeSymbol (`externalptr`) **reticulate** | python.builtin.function (`externalptr`), python.builtin.module (`externalptr`) **rJava** | jclassName (`externalptr`) **ShortRead** | FastqFile, FastqStreamer, FastqStreamerList (`connection`) **sparklyr** | tbl_spark (`externalptr`) **terra** | SpatRaster, SpatVector (`externalptr`) **udpipe** | udpipe_model (`externalptr`) **xgboost** | xgb.DMatrix (`externalptr`) **XML** | XMLInternalDocument, XMLInternalElementNode (`externalptr`) **xml2** | xml_document (`externalptr`) These are illustrated in sections 'Packages that rely on external pointers' and 'Packages with other types of non-external objects' below. Importantly, there are objects with external pointer than can indeed be exported. Here are some example, Package | Examples of exportable types or classes :---------------|:------------------------------------------- **data.table** | data.table (`externalptr`) **rstan** | stanmodel (`externalptr`) These are discussed in sections 'False positives - packages with exportable external pointers' at the very end of this vignette. ### Packages with connections #### Package: parallel ```r library(future) plan(multisession, workers = 2) cl <- parallel::makeCluster(2L) y <- parSapply(cl, X = 2:3, FUN = sqrt) y ## [1] 1.414214 1.732051 y %<-% parSapply(cl, X = 2:3, FUN = sqrt) y ## Error in summary.connection(connection) : invalid connection ``` If we turn on `options(future.globals.onReference = "error")`, we will catch this already when we create the future: ```r y %<-% parSapply(cl, X = 2:3, FUN = sqrt) ## Error: Detected a non-exportable reference ('externalptr') in one of the globals ## ('cl' of class 'SOCKcluster') used in the future expression ``` ### Packages that rely on external pointers If an object carries an _external pointer_, it is likely that it can only be used in the R session where it was created. If it is exported to and used in a parallel process, it will likely cause an error there. As shown above, and in below examples, setting option `future.globals.onReference` to `"error"` will make **future** to scan for _external pointer_:s before launching the future on a parallel worker, and throw an error if one is detected. However, there are objects with _external pointer_:s that can be exported, e.g. `data.table` objects of the **[data.table]** package is one such example. In other words, the existence of a _external pointer_ is just a suggestion for an object being non-exportable - it is not a sufficient condition. Below are some examples of packages who produce non-exportable objects with _external pointer_:s. #### Package: arrow The **[arrow]** package provides efficient in-memory storage of arrays and tables. However, these objects cannot be transferred as-is to a parallel worker. ```r library(arrow) library(future) plan(multisession) data <- as_arrow_table(iris) f <- future(dim(data)) v <- value(f) #> Error: Invalid , external pointer to null ``` This error takes place on the parallel worker. We could set `options(future.globals.onReference = "error")` to have **future** detect the problem before it sends the object over to the parallel worker. That said, the **arrow** package provides low-level functions `write_to_raw()` and `read_ipc_stream()` that can used to marshal and unmarshal **arrow** objects. For example, ```r library(arrow) library(future) plan(multisession) data <- as_arrow_table(iris) .data <- write_to_raw(data) ## marshal f <- future({ data <- read_ipc_stream(.data) ## unmarshal dim(data) }) v <- value(f) print(v) #> [1] 150 5 ``` #### Package: bigmemory The **[bigmemory]** package provides mechanisms for working with very large matrices that can be updated in-place, which helps save memory. For example, ```r library(bigmemory) g <- function(x) { x[1,1] <- 42L x } x <- big.matrix(nrow = 3, ncol = 2, type = "integer") print(x[1,1]) #> [1] NA void <- g(x) print(x[1,1]) #> [1] 42 ``` Note how `x` was updated in-place. This is achieved by `big.matrix` objects holds an external pointer to where the matrix data is stored; ```r str(x) #> Formal class 'big.matrix' [package "bigmemory"] with 1 slot #> ..@ address: ``` If we would try to use `x` in a parallel worker, then the parallel worker crashes due to a bug in **bigmemory**, e.g. ```r library(bigmemory) library(future) plan(multisession, workers = 2) x <- big.matrix(nrow = 3, ncol = 2, type = "integer") f <- future(dim(x), packages = "bigmemory") value(f) #> Error in unserialize(node$con) : #> MultisessionFuture () failed to receive message results from #> cluster RichSOCKnode #1 (PID 1746676 on localhost 'localhost'). The #> reason reported was 'error reading from connection'. Post-mortem #> diagnostic: No process exists with this PID, i.e. the localhost worker #> is no longer alive. Detected a non-exportable reference #> ('externalptr') in one of the globals ('x' of class 'big.matrix') used #> in the future expression. The total size of the 1 globals exported is #> 696 bytes. There is one global: 'x' (696 bytes of class 'S4') ``` We can protected against this setting: ```r options(future.globals.onReference = "error") ``` which gives: ```r f <- future(dim(x), packages = "bigmemory") #> Error: Detected a non-exportable reference ('externalptr') in one #> of the globals ('x' of class 'big.matrix') used in the future #> expression ``` #### Package: cpp11 Another example is **[cpp11]**, which allows us to easily create R functions that are implemented in C++, e.g. ```r cpp11::cpp_source(code = ' #include "cpp11/doubles.hpp" using namespace cpp11; [[cpp11::register]] int my_length(doubles x) { return x.size(); } ') ``` so that: ``` x <- rnorm(10) my_length(x) ## [1] 10 ``` However, this function cannot be exported to another R process: ```r library(future) plan(multisession) x <- rnorm(10) n %<-% my_length(x) n #> Error in .Call("_code_1748ff617940b9_my_length", x, PACKAGE = "code_1748ff617940b9") : #> "_code_1748ff617940b9_my_length" not available for .Call() for package "code_1748ff617940b9" ``` #### Package: DBI **[DBI]** provides a unified database interface for communication between R and various database engines. Analogously to regular connections in R, DBIConnection objects can _not_ safely be exported to another R process, e.g. ```r library(future) options(future.globals.onReference = "error") plan(multisession) library(DBI) con <- dbConnect(RSQLite::SQLite(), ":memory:") dummy %<-% print(con) ## Error: Detected a non-exportable reference ('externalptr') in one of the globals ## ('con' of class 'SQLiteConnection') used in the future expression ``` #### Package: inline Another example is **[inline]**, which allows us to easily create R functions that are implemented in C and C++, e.g. ```r library(inline) code <- " int i; for (i = 0; i < *n; i++) x[0] = x[0] + (i+1); " sum_1_to_n <- cfunction(signature(n="integer", x="numeric"), code, language = "C", convention = ".C") y <- sum_1_to_n(10, 0)$x print(y) ## 55 ``` However, if we would attempt to call `sum_1_to_n()` in a future, we get an error: ```r library(future) plan(cluster, workers = 1L) f <- future(sum_1_to_n(10, 0)) v <- value(f) ## Error in .Primitive(".C")(, n = as.integer(n), x = as.double(x)) : ## NULL value passed as symbol address ``` This is because: ```r options(future.globals.onReference = "error") f <- future(sum_1_to_n(10, 0)) ## Error: Detected a non-exportable reference ('externalptr' of class ## 'DLLHandle') in one of the globals ('sum_1_to_n' of class 'CFunc') ## used in the future expression ``` #### Package: keras The **[keras]** package provides an R interface to [Keras](https://keras.io/), which "is a high-level neural networks API developed with a focus on enabling fast experimentation". The R implementation accesses the Keras Python API via **[reticulate]**. However, Keras model instances in R make use of R connections and external pointers, which prevents them from being exported to external R processes. For example, if the attempt to use a Keras model in a multisession workers, the worker will produce a run-time error: ```r library(keras) library(future) plan(multisession) ## Adopted from the 'keras' vignettes inputs <- layer_input(shape = shape(32)) outputs <- layer_dense(inputs, units = 1L) model <- keras_model(inputs, outputs) model <- compile(model, optimizer = "adam", loss = "mean_squared_error") test_input <- array(runif(128 * 32), dim = c(128, 32)) test_target <- array(runif(128), dim = c(128, 1)) fit(model, test_input, test_target) f <- future({ stats::predict(model, test_input) }, seed = TRUE) pred <- value(f) ## Error in do.call(object$predict, args) : ## 'what' must be a function or character string ``` This is error message is not very helpful. But, if we turn on `options(future.globals.onReference = "error")`, we get more clues; ```r Error: Detected a non-exportable reference ('externalptr') in one of the globals ('model' of class 'keras.engine.functional.Functional') used in the future expression ``` Functions `serialize_model()` and `unserialize_model()` of the **keras** package can be used as workaround to marshal and unmarshal non-exportable **keras** objects, e.g. ```r .model <- serialize_model(model) ## marshal f <- future({ model <- unserialize_model(.model) ## unmarshal stats::predict(model, test_input) }, seed = TRUE) rm(.model) ## not needed anymore pred <- value(f) str(pred) ## num [1:128, 1] 0.6937 -0.048 0.2996 -0.0818 1.0673 ... ``` #### Package: magick The **[magick]** package provides an R-level API for [ImageMagick](https://imagemagick.org/) to work with images. When working with this API, the images are represented internally as external pointers of class 'magick_image' that cannot be be exported to another R process, e.g. ```r library(future) plan(multisession) library(magick) frink <- magick::image_read("https://jeroen.github.io/images/frink.png") f <- future(image_fill(frink, "orange", "+100+200", 20)) v <- value(f) ## Error: Image pointer is dead. You cannot save or cache image objects ## between R sessions. ``` If we set: ```r options(future.globals.onReference = "error") ``` we'll see that this is caught even before attempting to run this in parallel; ```r > f <- future(image_fill(frink, "orange", "+100+200", 20)) ## Error: Detected a non-exportable reference ('externalptr' of class ## 'magick-image') in one of the globals ('frink' of class 'magick-image') ## used in the future expression ``` #### Package: polars The **[polars]** package provides objects for performant processing on tabular data. However, these objects are tied to the R process that created them. If we attempt to use them in a parallel worker, we end up crashing the parallel worker: ```r library(future) plan(multisession) library(polars) data <- as_polars_df(data.frame(x = 1:3)) f <- future(dim(data), packages = "polars") v <- value(f) #> Error: Execution halted with the following contexts #> 0: In R: in `$.RPolarsDataFrame` #> 0: During function call [workRSOCK()] #> 1: This Polars object is not valid. Execute `rm()` to remove #> the object or restart the R session. ``` This is because the external pointer in the `RPolarsDataFrame` object is erased when transferred to another process, which **polars** (>= 0.15.0) detects and gives an informative error message about. #### Package: raster The **[raster]** package provides methods for working with spatial data, which are held in 'RasterLayer' objects. Not all but some of these objects use an external pointer. For example, ```r library(future) plan(multisession) options(future.globals.onReference = "error") library(raster) r <- raster(system.file("external/test.grd", package = "raster")) tf <- tempfile(fileext = ".grd") s <- writeStart(r, filename = tf, overwrite = TRUE) f <- future({ print(dim(r)) print(dim(s)) }) Error: Detected a non-exportable reference ('externalptr') in one of the globals ('s' of class 'RasterLayer') used in the future expression ``` Note that it is only the RasterLayer object `s` that carries an external pointer. If we dig deeper, we find that this is because `attr(s@file, "con")` is file connection opened for writing. This is why `s` cannot be passed on to an external worker. In contrast, RasterLayer object `r` does not have this problem and would be fine to pass on to a worker. #### Package: Rcpp Similarly to **cpp11**, **[Rcpp]** can be use to create R functions that are implemented in C++, e.g. ```r Rcpp::sourceCpp(code = ' #include using namespace Rcpp; // [[Rcpp::export]] int my_length(NumericVector x) { return x.size(); } ') ``` so that: ``` x <- 1:10 my_length(x) ## [1] 10 ``` However, since this function uses an external pointer internally, we cannot pass it to another R process: ```r library(future) plan(multisession) x <- rnorm(10) n %<-% my_length(x) n ## Error in .Call(, x) : NULL value passed as symbol address ``` We can detect and protect against this using: ```r options(future.globals.onReference = "error") n %<-% my_length(x) ## Error: Detected a non-exportable reference ('externalptr' of class ## 'NativeSymbol') in one of the globals ('my_length' of class 'function') ## used in the future expression ``` #### Package: reticulate The **[reticulate]** package provides methods for creating and calling Python code from within R. If one attempt to use Python-binding objects from this package, we get errors like: ```r library(future) plan(multisession) library(reticulate) os <- import("os") pwd %<-% os$getcwd() pwd ## Error in eval(quote(os$getcwd()), new.env()) : ## attempt to apply non-function ``` and by telling the **future** package to validate globals further, we get: ```r options(future.globals.onReference = "error") pwd %<-% os$getcwd() ## Error: Detected a non-exportable reference ('externalptr') in one of the ## globals ('os' of class 'python.builtin.module') used in the future expression ``` Another reticulate example is when we try to use a Python function that we create ourselves as in: ```r cat("def twice(x):\n return 2*x\n", file = "twice.py") source_python("twice.py") twice(1.2) ## [1] 2.4 y %<-% twice(1.2) y ## Error in unserialize(node$con) : ## Failed to retrieve the value of MultisessionFuture from cluster node #1 ## (on 'localhost'). The reason reported was 'error reading from connection' ``` which, again, is because: ```r options(future.globals.onReference = "error") y %<-% twice(1.2) ## Error: Detected a non-exportable reference ('externalptr') in one of the globals ## ('twice' of class 'python.builtin.function') used in the future expression ``` #### Package: rJava Here is an example that shows how **[rJava]** objects cannot be exported to external R processes. ```r library(future) plan(multisession) library(rJava) .jinit() ## Initialize Java VM on master Double <- J("java.lang.Double") d0 <- new(Double, "3.14") d0 ## [1] "Java-Object{3.14}" f <- future({ .jinit() ## Initialize Java VM on worker new(Double, "3.14") }) d1 <- value(f) d1 ## [1] "Java-Object" ``` Although no error is produced, we see that the value `d1` is a Java NULL Object. As before, we can catch this by using: ```r options(future.globals.onReference = "error") f <- future({ .jinit() ## Initialize Java VM on worker new(Double, "3.14") }) ## Error: Detected a non-exportable reference ('externalptr') in one of the ## globals ('Double' of class 'jclassName') used in the future expression ``` #### Package: ShortRead The **[ShortRead]** package from Bioconductor implements efficient methods for sampling, iterating, and reading FASTQ files. Some of the helper objects used cannot be saved to file or exported to a parallel worker, because they comprise of connections and other non-exportable objects. Here is an example that illustrates how an attempt to use a 'FastqStreamer' object created in the main R session fails when used in a parallel worker: ```r library(future) plan(multisession) # Adopted from example("FastqStreamer", package = "ShortRead") library(ShortRead) sp <- SolexaPath(system.file("extdata", package = "ShortRead")) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") fs <- FastqStreamer(fl, 50) reads %<-% yield(fs) reads ## Error in status(update = TRUE) : invalid FastqStreamer ``` To catch this earlier, and to get a more informative error message, we do as before; ```r options(future.globals.onReference = "error") reads %<-% yield(fs) ## Error: Detected a non-exportable reference ('externalptr') in one of the ## globals ('fs' of class 'FastqStreamer') used in the future expression ``` #### Package: sparklyr ```r library(future) plan(multisession) library(sparklyr) sc <- spark_connect(master = "local") file <- system.file("misc", "exDIF.csv", package = "utils") data <- spark_read_csv(sc, "exDIF", file) d %<-% dim(data) d ## Error in unserialize(node$con) : ## Failed to retrieve the value of MultisessionFuture () from cluster ## SOCKnode #1 (PID 29864 on localhost 'localhost'). The reason reported was ## 'unknown input format'. Post-mortem diagnostic: A process with this PID ## exists, which suggests that the localhost worker is still alive. ``` To catch this as soon as possible, ```r options(future.globals.onReference = "error") d %<-% dim(data) ## Error: Detected a non-exportable reference ('externalptr') in one of ## the globals ('data' of class 'tbl_spark') used in the future expression ``` #### Package: terra ```r library(future) plan(multisession) library(terra) file <- system.file("ex/lux.shp", package = "terra") v <- vect(file) dv %<-% dim(v) dv Error in x@ptr$nrow() : external pointer is not valid file <- system.file("ex/elev.tif", package = "terra") r <- rast(file) dr %<-% dim(r) dr ## Error in .External(list(name = "CppMethod__invoke_notvoid", address = , : ## NULL value passed as symbol address ``` To catch this as soon as possible, ```r options(future.globals.onReference = "error") dv %<-% dim(v) ## Error: Detected a non-exportable reference ('externalptr' of class ## 'RegisteredNativeSymbol') in one of the globals ('v' of class ## 'SpatVector') used in the future expression dr %<-% dim(data) ## Error: Detected a non-exportable reference ('externalptr' of class ## 'RegisteredNativeSymbol') in one of the globals ('r' of class ## 'SpatRaster') used in the future expression ``` Functions `wrap()` and `unwrap()` of the **terra** package can be used as workaround to marshal and unmarshal non-exportable **terra** objects, e.g. ```r library(future) plan(multisession) library(terra) file <- system.file("ex/lux.shp", package = "terra") v <- vect(file) .v <- wrap(v) ## marshal dv %<-% { v <- unwrap(.v) ## unmarshal dim(v) } rm(.v) ## not needed anymore dv [1] 12 6 ``` and ```r file <- system.file("ex/elev.tif", package = "terra") r <- rast(file) .r <- wrap(r) ## marshal dr %<-% { r <- unwrap(.r) ## unmarshal dim(r) } rm(.r) ## not needed anymore dr [1] 90 95 1 ``` For more details, see `help("wrap", package = "terra")`. #### Package: udpipe ```r library(future) plan(multisession) library(udpipe) udmodel <- udpipe_download_model(language = "dutch") udmodel <- udpipe_load_model(file = udmodel$file_model) x %<-% udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee.") x ## Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, : ## external pointer is not valid ``` To catch this as soon as possible, ```r options(future.globals.onReference = "error") x %<-% udpipe_annotate(udmodel, x = "Ik ging op reis en ik nam mee.") ## Error: Detected a non-exportable reference ('externalptr') in one of the ## globals ('udmodel' of class 'udpipe_model') used in the future expression ``` Now, it is indeed possible to parallelize **[udpipe]** calls. For details on how to do this, see the 'UDPipe Natural Language Processing - Parallel' vignette that comes with the **udpipe** package. #### Package: xgboost The **[xgboost]** package provides fast gradient-boosting methods. Some of its data structures use external pointers. For example, ```r library(future) plan(multisession) library(xgboost) data(agaricus.train, package = "xgboost") train <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) class(train) ## [1] "xgb.DMatrix" d <- dim(train) d ## [1] 6513 126 ``` works just fine but if we attempt to pass on the 'xgb.DMatrix' object `train` to an external worker, we silently get a incorrect value: ```r f <- future(dim(train)) d <- value(f) d ## NULL ``` This is unfortunate, but we can at least detect this by: ```r options(future.globals.onReference = "error") f <- future(dim(dtrain)) ## Error: Detected a non-exportable reference ('externalptr' of class 'xgb.DMatrix') ## in one of the globals ('dtrain' of class 'xgb.DMatrix') used in the future expression ``` This is because `train` itself is an external pointer, i.e. `mode(train) == "externalptr"`. #### Package: XML Another example is XML objects of the **[XML]** package, which may produce evaluation error, or even cause R to abort if used in another R process, e.g. ```r library(future) plan(multisession) library(XML) doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML")) a <- getNodeSet(doc, "/doc//a[@status]")[[1]] f <- future(xmlGetAttr(a, "status")) value(f) ## Error in unserialize(node$con) : ## MultisessionFuture () failed to receive results from cluster ## RichSOCKnode #1 (PID 31541 on localhost 'localhost'). The reason ## reported was 'error reading from connection'. Post-mortem diagnostic: ## No process exists with this PID, i.e. the localhost worker is no ## longer alive. Detected a non-exportable reference ('externalptr' of ## class 'XMLInternalElementNode') in one of the globals ('a' of class ## 'XMLInternalElementNode') used in the future expression. The total ## size of the 1 globals exported is 520 bytes. There is one global: 'a' ## (520 bytes of class 'externalptr') ``` This is an example, where we end up exporting an `XMLInternalElementNode` object to another R process, where it is no longer valid. When we try to use it there by calling `xmlGetAttr()` on it, **XML** causes R to crash and abort. To illustrate what's going on the parallel workers, if we save the object to file using `saveRDS(a, "a.rds")`, and try to use it in _another_ R session, the following happens: ```r $ R --quiet --vanilla > a <- readRDS("a.rds") > XML::xmlGetAttr(a, "status") *** caught segfault *** address 0x40, cause 'memory not mapped' Traceback: 1: xmlAttrs.XMLInternalNode(node, addNamespace) 2: xmlAttrs(node, addNamespace) 3: XML::xmlGetAttr(a, "status") Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: ``` This is a very harsh way of telling us that we cannot export all types of objects produced by **XML**. Ideally, **XML** would detect this and give am informative error message and not crash R like this. A workaround for working around this is to marshal the problematic objects before exporting them to a parallel R process, and unmarshal them before working with them there. For example, ```r library(future) plan(multisession) library(XML) doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML")) a <- getNodeSet(doc, "/doc//a[@status]")[[1]] ## Marshall the non-exportable XMLInternalElementNode object .a <- xmlSerializeHook(a) ## marshal f <- future({ a <- xmlDeserializeHook(.a) ## unmarshal xmlGetAttr(a, "status") }) value(f) ## [1] "xyz" ``` An alternative, more generic workaround, is to always create the `doc` element, an `XMLInternalDocument` object, on the parallel workers, i.e. ```r library(future) plan(multisession) library(XML) f <- future({ doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML")) a <- getNodeSet(doc, "/doc//a[@status]")[[1]] xmlGetAttr(a, "status") }) value(f) ## [1] "xyz" ``` #### Package: xml2 Yet another example is XML objects of the **[xml2]** package, which may produce evaluation errors (or just invalid results depending on how they are used), e.g. ```r library(future) plan(multisession) library(xml2) doc <- read_xml("") f <- future(xml_children(doc)) value(f) ## Error: external pointer is not valid ``` The future framework can help detect this _before_ sending off the future to the worker; ```r options(future.globals.onReference = "error") f <- future(xml_children(xml)) ## Error: Detected a non-exportable reference ('externalptr') in one of the ## globals ('xml' of class 'xml_document') used in the future expression ``` One workaround when dealing with non-exportable objects is to look for ways to encode the object such that it can be exported, and the decoded on the receiving end. With **xml2**, we can use `xml2::xml_serialize()` and `xml2::xml_unserialize()` to do this. Here is how we can rewrite the above example such that we can pass **xml2** object back and forth between the main R session and R workers: ```r ## Encode the 'xml_document' object 'doc' as a 'raw' object .doc <- xml_serialize(doc, connection = NULL) ## marshal f <- future({ ## In the future, reconstruct the 'xml_document' object ## from the 'raw' object doc <- xml_unserialize(.doc) ## unmarshal ## Continue as usual children <- xml_children(doc) ## Send back a 'raw' representation of the 'xml_nodeset' ## object 'children' xml_serialize(children, connection = NULL) }) ## Reconstruct the 'xml_nodeset' object in the main R session children <- xml_unserialize(value(f)) ``` ### Packages with other types of non-external objects #### Package: ncdf4 Package **[ncdf4]** provides an R API to work with data that live in netCDF files. For example, we can create a simple netCDF file that holds a variable 'x': ```r library(ncdf4) x <- ncvar_def("x", units = "count", dim = list()) file <- nc_create("example.nc", x) ncvar_put(file, x, 42) nc_close(file) ``` We can now use this netCDF file next time we start R, e.g. ```r library(ncdf4) file <- nc_open("example.nc") y <- ncvar_get(file) y ## [1] 42 ``` However, it would fail if we attempt to use `file`, which is an object of class 'ncdf4', in a parallel worker, we will get an error: ```r library(future) plan(multisession) library(ncdf4) file <- nc_open("example.nc") f <- future(ncvar_get(file)) y <- value(f) ## Error in R_nc4_inq_varndims: NetCDF: Not a valid ID ## Error in ncvar_ndims(ncid, varid) : error returned from C call ``` This is because ncdf4 objects make use of internal references that are unique to the R session where they were created. However, these are _not_ formal _external pointer_:s, meaning the future framework cannot detect them. That is, using `options(future.globals.onReference = "error")` is of no help here. A workaround is to open the netCDF in each worker, e.g. ```r library(future) plan(multisession) library(ncdf4) f <- future({ file <- nc_open("example.nc") value <- ncvar_get(file) nc_close(file) value }) y <- value(f) y ## [1] 42 ``` ### False positives - packages with exportable external pointers #### Package data.table The **[data.table]** package creates objects comprising external pointers. Contrary to above non-exportable examples, such objects can be saved to file and used in another R session, or exported to a parallel worker. This is because **data.table** is capable of restoring these objects to a valid state. Consider the following example: ```r library(data.table) DT <- data.table(a = 1:3, b = letters[1:3]) ## Extract second row row <- DT[2] print(row) #> a b #> 1: 2 b ``` If we would try the last step with a future with strict checking for references enabled, we would get an error: ```r library(future) plan(multisession) options(future.globals.onReference = "error") row %<-% DT[2] Error: Detected a non-exportable reference ('externalptr') in one of the globals ('DT' of class 'data.table') used in the future expression ``` This is a false positive. If we relax the checks, it does indeed work: ```r options(future.globals.onReference = NULL) row <- DT[2] print(row) #> a b #> 1: 2 b ``` #### Package rstan The **[rstan]** package creates objects comprising external pointers. Contrary to above non-exportable examples, such objects can be saved to file and used in another R session, or exported to a parallel worker. This is because **rstan** is capable of restoring these objects to a valid state. Consider the following example from `example("rstan", package = "rstan")`: ```r library(rstan) code <- " data { int N; real y[N]; } parameters { real mu; } model { target += normal_lpdf(mu | 0, 10); target += normal_lpdf(y | mu, 1); } " y <- rnorm(20) data <- list(N = 20, y = y) fit <- stan(model_code = code, model_name = "example", data = data, iter = 2012L, chains = 3L, sample_file = file.path(tempdir(), "norm.csv")) e <- extract(fit, permuted = FALSE) ``` If we would try the last step with a future with strict checking for references enabled, we would get an error: ```r library(future) plan(multisession) options(future.globals.onReference = "error") e %<-% extract(fit, permuted = FALSE) Error: Detected a non-exportable reference ('externalptr' of class 'DLLHandle') in one of the globals ('fit' of class 'stanfit') used in the future expression ``` However, this is a false positive. The `fit` object, which is of class 'stanfit', can indeed be exported to be used in an external R process, e.g. ```r options(future.globals.onReference = NULL) e %<-% extract(fit, permuted = FALSE) str(e) #> num [1:1006, 1:3, 1:2] -0.3028 -0.4017 -0.3379 -0.2358 0.0443 ... #> - attr(*, "dimnames")=List of 3 #> ..$ iterations: NULL #> ..$ chains : chr [1:3] "chain:1" "chain:2" "chain:3" #> ..$ parameters: chr [1:2] "mu" "lp__" ``` [data.table]: https://cran.r-project.org/package=data.table [arrow]: https://cran.r-project.org/package=arrow [bigmemory]: https://cran.r-project.org/package=bigmemory [cpp11]: https://cran.r-project.org/package=cpp11 [DBI]: https://cran.r-project.org/package=DBI [inline]: https://cran.r-project.org/package=inline [keras]: https://cran.r-project.org/package=DBI [magick]: https://cran.r-project.org/package=magick [ncdf4]: https://cran.r-project.org/package=ncdf4 [polars]: https://rpolars.github.io/ [raster]: https://cran.r-project.org/package=raster [Rcpp]: https://cran.r-project.org/package=Rcpp [reticulate]: https://cran.r-project.org/package=reticulate [rJava]: https://cran.r-project.org/package=rJava [rstan]: https://cran.r-project.org/package=rstan [udpipe]: https://cran.r-project.org/package=udpipe [xgboost]: https://cran.r-project.org/package=xgboost [XML]: https://cran.r-project.org/package=XML [xml2]: https://cran.r-project.org/package=xml2 [ShortRead]: https://bioconductor.org/packages/ShortRead/