1. Introduction to Dataverse

2024-05-12

The dataverse package is the official R client for Dataverse 4 data repositories. The package enables data search, retrieval, and deposit with any Dataverse installation, thus allowing R users to integrate public data sharing into the reproducible research workflow.

In addition to this introduction, the package contains additional vignettes covering:

They can be accessed from CRAN or from within R using vignettes(package = "dataverse").

Installation

The dataverse client package can be installed from CRAN, and you can find the latest development version and report any issues on GitHub:

remotes::install_github("iqss/dataverse-client-r")
library("dataverse")

Terminology

Dataverse has some terminology that is worth quickly reviewing before showing how to work with Dataverse in R. Dataverse is an application that can be installed in many places. As a result, dataverse can work with any installation but you need to specify which installation you want to work with. This can be set by default with an environment variable, DATAVERSE_SERVER:

library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

This should be the Dataverse server, without the “https” prefix or the “/api” URL path, etc. The package attempts to compensate for any malformed values, though.

Within a given Dataverse installation, organizations or individuals can create objects that are also called “Dataverses”. These Dataverses can then contain other dataverses, which can contain other dataverses, and so on. They can also contain datasets which in turn contain files. You can think of Harvard’s Dataverse as a top-level installation, where an institution might have a dataverse that contains a subsidiary dataverse for each researcher at the organization, who in turn publishes all files relevant to a given study as a dataset.

Get

To retrieve a data file, we need to investigate the dataset being returned and look at what files it contains using a variety of functions:

get_dataset()
dataset_files()
get_file_metadata()
get_file()
get_dataframe_by_name()

The most practical of these is likely get_dataframe_by_name() which imports the object directly as a dataframe. get_file() is more primitive, and calls a raw vector.

Recall that, because datasets in Dataverse are a collection of files rather than a single csv file, for example, the get_dataset() function does not return data but rather information about a Dataverse dataset.

The download vignette describes this functionality in more detail.

Upload and Maintain

For “native” Dataverse features (such as user account controls) or to create and publish a dataset, you will need an API key linked to a Dataverse installation account. Instructions for obtaining an account and setting up an API key are available in the Dataverse User Guide. (Note: if your key is compromised, it can be regenerated to preserve security.) Once you have an API key, this should be stored as an environment variable called DATAVERSE_KEY. It can be set within R using:

Sys.setenv("DATAVERSE_KEY" = "examplekey12345")

where examplekey12345 should be replaced with your own key.

With that set, you can easily create a new dataverse, create a dataset within that dataverse, push files to the dataset, and release it, using functions such as

initiate_sword_dataset()
add_dataset_file()
publish_dataset()

As of dataverse version 0.3.0, we recommended the Python client (https://github.com/gdcc/pyDataverse) for these upload and maintenance functions.