Repana Structure

John J Aponte

2024-01-21

Introduction

Repana is an opinionated framework, meaning that the project’s structure must be predefined to determine where different types of files are stored. The structure of repana is governed by the config.yml file, and the repana::make_structure() function aids in constructing the directory layout. If no config.yml is present, make_structure() generates one.

Default structure

The default structure is established using the make_structure() function, which creates a config.yml file with predefined items for the Repana package.

default:
  dirs:
    data: _data
    functions: _functions
    handmade: handmade
    database: database
    reports: reports
    logs: logs
  clean_before_new_analysis:
    - database
    - reports
    - logs
  defaultdb:
    package: duckdb
    dbconnect: duckdb
    read_only: FALSE
  template:
    _template.txt

Section dirs:

The dirs section defines the directories that the structure should maintain. Each entry consists of a nickname for the directory and its corresponding physical location. The get_dirs() function returns the physical location within programs.

For example, using the default definition, get_dirs(“data”) returns “_data”. This abstraction allows program logic to remain separate from the actual physical directory names, enabling different users to use the same programs without modification, even if the physical locations differ.

By default, six directories are defined, each serving a specific purpose:

Entry | Purpose |

|—— —-|—————————————————————-| | data | Input data to the project | | functions | Functions used in the project | | handmade | Files created not using programs in the project | | database | Database and other secondary files created by the project | | reports | Reports, graphs, files and other output created by the project | | logs | Log of executed files |

: Directories defined in config.yml

Note: The handmade directory is crucial for maintaining the spirit of reproducible analysis. While all project output should ideally stem from program actions on inputs, the handmade directory serves as a space for files modified by hand or kept for reference.

Section clean_before_new_analysis:

As mentioned earlier, the essence of reproducible analysis involves being able to reproduce project outputs with the same inputs. To ensure outputs are produced by a new analysis, it is recommended to delete existing outputs before recreating them. The clean_before_new_analysis section specifies the directories deleted before a new analysis. The make_structure() function updates the .gitignore file to exclude these directories from git version control.

WARNING: The clean_structure() function will delete all directories listed under the clean_before_new_analysis entry.

Section defaultdb:

This section defines the arguments needed to create a connection with a database using the DBI system. Multiple connections can be defined under new entries. The get_con() function establishes a connection based on the information in the config.yml file. Refer to the Database configuration Vignette for detailed instructions on setting up and using database connections.

Section template:

If using the RStudio IDE, the package installs an addin named “Repana insert template,” which inserts a default template for program documentation. This default template can be modified, and if a different file is used, the template section informs the system of its location. See the Modifying the template on how to use and modify the template.

Workflow

A workflow using GitHub and repana in RStudio would be

  1. Create the project in GitHub

  2. Update the README.md file

  3. Copy the URL link of the project

  4. In RStudio, create a new project from “Version Control”, Select Git and fill in the URL link of the project and the location

  5. Once the project is created, run repana::make_structure() function

Your new project is ready.

  1. Share the config.yml file to your collaborators so they can adapt to local conditions. The config.yml is included in .gitignore and not uploaded to GitHub to allow each collaborator to have its own definition.

  2. Update the project and create new programs (e.g. 01_xxx, 02_xxx, etc.)

  3. Run the project programs using repana::master()

WARNING by default, the _data directory is not include in the .gitignore file. Consider to include it if the _data directory contains sensitive information that should not be uploaded to GitHub. This directory could be shared between collaborators using a different method.

For more information, see the Repana Documentation.