In this document, we will outline the design decisions that have steered the development strategies of the {cleanepi} R package, along with the rationale behind each decision and the potential advantages and disadvantages associated with them.
Data cleaning is an important phase for ensuring the efficacy of downstream analysis. The procedures entailed in the cleaning process may differ based on the data type and research objectives. Nonetheless, certain steps can be applied universally across diverse data types, irrespective of their origin.
The {cleanepi} R package is designed to offer functional programming-style data cleansing tasks. To streamline the organization of data cleaning operations, we have categorized them into distinct groups referred to as modules. These modules are based on overarching goals derived from commonly anticipated data cleaning procedures. Each module features a primary function along with additional helper functions tailored to accomplish specific tasks. It’s important to note that, except for few cases where the outcome a helper function can impact on the cleaning task, only the main function of each module will be exported. This deliberate choice empowers users to execute individual cleaning tasks as needed, enhancing flexibility and usability.
At the core of {cleanepi}, the pivotal function
clean_data()
serves as a wrapper encapsulating all the
modules, as illustrated in the figure above. This function is intended
to be the primary entry point for users seeking to cleanse their data.
It performs the cleaning operations as requested by the user through the
set of parameters that need to be explicitly defined. Furthermore,
multiple cleaning operations can be performed sequentially using the
“pipe” operators (|>
or %>%
). In
addition, this package also has two surrogate functions:
scan_data()
: This function enables users to assess the
data types present in each column of their dataset.print_report()
: By utilizing this function, users can
visualize the report generated from each applied cleaning task,
facilitating transparency and understanding of the data cleaning
process.{cleanepi} is an R package crafted to clean, curate, and standardize tabular datasets, with a particular focus on epidemiological data. In the architecture of {cleanepi}, the data cleaning operations are categorized into modules, each provides a specific data cleaning task. The modules in the current version of {cleanepi} encompass the:
NA
,By compartmentalizing these operations into modules, {cleanepi} offers users a systematic and adaptable framework to address diverse data cleaning needs, especially within the realm of epidemiological datasets.
The primary functions of the modules, as well as the core function
clean_data()
, accept input in the form of a
data.frame
or linelist
. This offers
flexibility for users regarding where they want to position {cleanepi}
within the R package ecosystem for epidemic analysis pipelines, either
to clean data before or after converting it to a
linelist
.
In addition to the target dataset, the core function
clean_data()
accepts a list
of operations to
be executed on the dataset. It subsequently invokes the primary
functions specified for each module.
Both the primary functions of the modules and the core function
clean_data()
return an object of type
data.frame
or linelist
, depending on the type
of the input dataset. The report generated from all cleaning tasks is
attached to this object as an attribute, which can be accessed using the
attr()
function in base R.
In this section, we provide a detailed description of the way that every module is built.
1. Standardization of column names
This module is designed to standardize the style and format of column names within the target dataset, offering users the flexibility to specify a subset of:
focal columns to preserve in their original format, and
columns to be renamed i.e. given a new name chosen by the user.
Main function:
standardize_column_names()
Input:
data.frame
or linelist
object.vector
of focal column names and a vector
of column names to be renamed in the form of
new_name = "old_name"
. If not provided, all columns will
undergo standardization.Output:
Report:
Mode:
By incorporating the standardize_column_names()
function, {cleanepi} streamlines the process of ensuring consistency and
clarity in column naming conventions, thereby enhancing the overall
organization and readability of the dataset.
2. Removal of empty rows and columns and constant columns
This module aims at eliminating irrelevant and redundant rows and columns, including empty rows and columns as well as constant columns.
remove_constants()
data.frame
or
linelist
object, along with:
3. Detection and removal of duplicates
This module is designed to identify and eliminate duplicated rows.
find_duplicates(), remove_duplicates()
data.frame
or
linelist
object, along with optional parameters:
linelist_tags
to consider tagged variables
only when the input is a linelist object).remove = TRUE
).Through the remove_duplicates()
function, users can
streamline their dataset by eliminating redundant rows, thus enhancing
data integrity and analysis efficiency.
4. Replacement of missing values with
NA
This module aims to standardize and unify the representation of missing values within the dataset.
replace_missing_values()
data.frame
or
linelist
object, along with:
vector
of column names (if not provided, the
operation is performed across all columns)NA
.By utilizing the replace_missing_char()
function, users
can ensure consistency in handling missing values across their dataset,
facilitating accurate analysis and interpretation of the data.
5. Standardization of date values
This module is dedicated to convert date values in character columns
into Date
value in ISO8601
format, and
ensuring that all dates fall within the given timeframe.
standardize_dates()
data.frame
or
linelist
object, along with:
vector
of targeted date columns (automatically
determined if not provided)NA
) values to be allowed in a converted
column. When % missing values exceeds or is equal to it, the original
values are returned (default value is 40%)By employing the standardize_dates()
function, users can
ensure uniformity and coherence in date formats across their dataset,
while also validating the temporal integrity of the data within the
defined timeframe.
6. Standardization of subject IDs
This module is tailored to verify whether the values in the column uniquely identifying subjects adhere to a consistent format. It also offers a functionality that allow users to correct the inconsistent subject ids.
check_subject_ids(), correct_subject_ids()
data.frame
or
linelist
object, along with:
By utilizing the functions in this module, users can ensure uniformity in the format of subject ids, facilitating accurate tracking and analysis of individual subjects within the dataset.
7. Dictionary based substitution
This module facilitates dictionary-based substitution, which involves
replacing existing values with predefined ones. It replaces entries in a
specific columns to certain values, such as substituting 1 with “male”
and 2 with “female” in a gender column. It also interoperates seamlessly
with the get_meta_data()
function from {readepi} R
package.
clean_using_dictionary()
data.frame
or
linelist
object, along with a data dictionary featuring the
following column names: options, values, and
order.By leveraging the clean_using_dictionary()
function,
users can streamline and standardize the values within specific columns
based on predefined mappings, enhancing consistency and accuracy in the
dataset.
Note that the clean_using_dictionary()
function will
return a warning when it detects unexpected values in the target columns
from the data dictionary. The unexpected values can be added to the data
dictionary using the add_to_dictionary()
function.
8. Conversion of values when necessary
This module is designed to convert numbers written in letters to
numerical values, ensuring interoperability with the
{numberize}
package.
convert_to_numeric()
data.frame
or
linelist
object, along with:
English, French and Spanish
.By employing the convert_to_numeric()
function, users
can seamlessly transform numeric representations written in letters into
numerical values, ensuring compatibility with the {numberize} package
and promoting accuracy in numerical analysis.
Note that convert_to_numeric()
will issue a warning for
unexpected values and return them in the report.
9. Verification of the sequence of date-events
This module provides functions to verify whether the sequence of date events aligns with expectations. For instance, it can flag rows where the date of admission to the hospital precedes the individual’s date of birth.
check_date_sequence()
data.frame
or
linelist
object, along with:
By using the check_date_sequence()
function, users can
systematically validate and ensure the coherence of date sequences
within their dataset, promoting accuracy and reliability in subsequent
analyses.
10. Transformation of selected columns
This module is dedicated to performing various specialized operations related to epidemiological data analytics, and it currently includes the following functions:
timespan()
data.frame
or
linelist
object, along with:
By leveraging the timespan()
function, users can
efficiently compute and integrate time span information into their
epidemiological dataset based on user-defined parameters, enhancing the
analytics capabilities of the dataset.
scan_data()
: This function is designed to generate a
quick summary of the dataset, offering insights into the composition of
each column. It calculates the percentage of values belonging to
different data types such as character, numeric, missing, logical, and
date. This summary can help analysts and data scientists understand the
structure and content of the dataset at a glance.
print_report()
: This function is used for displaying
the report detailing the result of the cleaning operations executed on
the dataset. It likely presents information about the data cleaning
processes performed, such as handling missing values, correcting data
types, removing duplicates, and any other transformations applied to
ensure data quality and integrity.
These surrogate functions play crucial roles in the data analysis and cleaning workflow, providing valuable information and documentation about the dataset characteristics and the steps taken to prepare it for analysis or modelling.
The modules and surrogate functions will depend mainly on the following packages:
{numberize}
used for the conversion of number from character to numeric, {dplyr} used in many
way including filtering, column creation, data summary, etc, {magrittr} used
here for its %>%
operator, {linelist} used
to perform some operations on linelist-type input objects, {janitor} used
here for the removal of constant data (empty rows and columns, as well
as constant columns), {matchmaker}
utilized to perform the dictionary-based cleaning, {lubridate} used to
create, handle, and manipulate objects of type Date, {reactable}
mainly used here to customize the data cleaning report, {arsenal} used in
standardizing column names, {glue} used here in
substitution of paste() and paste0()
to avoid linters, {snakecase} used
in standardizing column names to transform everything into snake-case
except when specified otherwise, {withr} utilized to
handle the creation of temporary files and directory relevant for
print_report()
and, {readr} used to
import data.
The functions will require all other packages that needed in the package development process including:
{checkmate}, {kableExtra}, {bookdown}, {rmarkdown}, {testthat} (>= 3.0.0), {knitr}, {lintr}
There are no special requirements to contributing to {cleanepi}, please follow the package contributing guide.