library(dplyr)
library(knitr)
library(stringr)
library(tidyr)
library(glue)
library(purrr)
library(psHarmonize)
The psHarmonize package provides functions that makes harmonizing multiple cohorts easier.
The main function is the harmonization()
function. It
takes an harmonization sheet as it’s input, and outputs
a list of objects based on what you’ve requested.
The harmonization sheet will serve as a set of instructions. It lists the source datasets, source variables (columns), and what modifications (if any) you will require.
id_var | item | study | domain | subdomain | source_dataset | source_item | visit | code1 | code_type | coding_notes | possible_range |
---|---|---|---|---|---|---|---|---|---|---|---|
idvar | age | Cohort A | Demographics | Age | cohort_a | age | 1 | NA | NA | No change needed | NA |
idvar | height | Cohort A | Demographics | Height | cohort_a | height_1 | 1 | x * 2.54 | function | Convert from inches to cm | NA |
idvar | height | Cohort A | Demographics | Height | cohort_a | height_2 | 2 | x * 2.54 | function | Convert from inches to cm | NA |
idvar | height | Cohort A | Demographics | Height | cohort_a | height_3 | 3 | x * 2.54 | function | Convert from inches to cm | NA |
idvar | weight | Cohort A | Demographics | Weight | cohort_a | weight_1 | 1 | x / 2.205 | function | Converting from lbs to kg | NA |
idvar | weight | Cohort A | Demographics | Weight | cohort_a | weight_2 | 2 | x / 2.205 | function | Converting from lbs to kg | NA |
It contains the following columns:
Column | Description |
---|---|
id_var | Name of participant ID variable in source dataset. This will be
renamed to ID in harmonized dataset. |
item | New variable name |
study | Name of cohort |
domain | Category name |
subdomain | Sub category name |
source_dataset | Source dataset name |
source_item | Existing variable name in source dataset |
visit | Visit number |
code1 | Code or instructions to modify original value |
code_type | “recode category” or “function” |
coding_notes | Notes to describe coding instructions |
possible_range | Range of numeric values that are valid for this variable. (Example:
[5, 100] ) |
Three sample datasets have been provided with the psHarmonize package.
idvar | age | height_1 | weight_1 | education | height_2 | weight_2 | height_3 | weight_3 |
---|---|---|---|---|---|---|---|---|
1001 | 56 | 65.72971 | 159.6694 | 3 | 65.78569 | 159.3111 | 65.88767 | 161.0290 |
1002 | 55 | 65.17160 | 160.8041 | 3 | 63.36848 | 160.0522 | 63.47919 | 159.1563 |
1003 | 55 | 63.03661 | 162.1213 | 3 | 64.25128 | 160.5126 | 64.85743 | 160.1114 |
1004 | 56 | 66.49534 | 159.4714 | 1 | 65.85359 | 159.1816 | 64.45130 | 158.8120 |
1005 | 55 | 63.95676 | 160.1953 | 4 | 65.52275 | 159.6045 | 65.39563 | 160.3091 |
1006 | 55 | 64.71709 | 159.6868 | 5 | 65.62382 | 159.2520 | 64.63992 | 160.1841 |
ID | Age | hgt_in | wgt_kg | edu_cat |
---|---|---|---|---|
2543 | 76 | 69.86692 | 69.16364 | 1 |
2544 | 75 | 69.44474 | 67.55916 | 1 |
2545 | 75 | 71.05661 | 68.81257 | 3 |
2546 | 76 | 70.20290 | 67.72010 | 1 |
2547 | 75 | 70.84808 | 69.09049 | 2 |
2548 | 75 | 71.58526 | 68.65528 | 4 |
cohort_id | age | height_cm | weight_lbs | edu |
---|---|---|---|---|
1054 | 74 | 179.0202 | 164.4654 | 3 |
1055 | 73 | 178.8935 | 164.6278 | 1 |
1056 | 75 | 178.6693 | 163.4455 | 2 |
1057 | 74 | 178.3882 | 165.0523 | 2 |
1058 | 76 | 178.0217 | 163.8981 | 1 |
1059 | 75 | 179.2889 | 164.5375 | 3 |
Cohort A has 10,000 participants, and 3 visits. The height is measured in inches, and the weight is measured in lbs.
Cohort A’s education categories are as follows:
Code | Description |
---|---|
1 | No education |
2 | Completed grade school |
3 | Jr-High School |
4 | Completed High School |
5 | Some college |
Cohort B has 5,000 participants, and 1 visit. The height in measured in inches, and the weight is measured in kg.
Cohort B’s education categories are as follows:
Code | Description |
---|---|
1 | Grade school |
2 | High school |
3 | College |
4 | Graduate or professional school |
Cohort C has 7,000 participants, and 1 visit. The height is measured in cm, and the weight is measured in lbs.
Cohort C’s education categories are as follows:
Code | Description |
---|---|
1 | Grade school |
2 | High school |
3 | Associate’s degree |
4 | Bachelor’s degree |
If we want to be able to pool data from these disparate cohorts together we will have to convert or recode some of the values in the original datasets.
For example, since the cohorts all have different units for
continuous measurements (like height and weight), we will have to
convert these values so they have the similar units across cohorts (cm
and kg respectively). This will be handed with the function
code type in the harmonization function.
Categorical values will have to be collapsed into similar values.
Education will have to be recoded into groupings that appropriate
account for the original values. This will be handed with the
recode category
code type in the harmonization
function.
The harmonization sheet is the input to the harmonization function.
It is essentially a set of directions on how to modify data in order to
create a harmonized dataset. This modification can be in the form of a
function
, recode
, or no modification.
When the harmonization
function is called, the current
cohort, subdomain, and visit is printed to the console.
harmonization_object <- harmonization(harmonization_sheet = harmonization_sheet_example,
long_dataset = TRUE,
wide_dataset = TRUE,
error_log = TRUE,
source_variables = TRUE)
#> Currently on item: age; cohort: Cohort A; visit 1 / 1.
#> Currently on item: age; cohort: Cohort B; visit 1 / 1.
#> Currently on item: age; cohort: Cohort C; visit 1 / 1.
#> Currently on item: height; cohort: Cohort A; visit 1 / 3.
#> Currently on item: height; cohort: Cohort A; visit 2 / 3.
#> Currently on item: height; cohort: Cohort A; visit 3 / 3.
#> Currently on item: height; cohort: Cohort B; visit 1 / 1.
#> Currently on item: height; cohort: Cohort C; visit 1 / 1.
#> Currently on item: weight; cohort: Cohort A; visit 1 / 3.
#> Currently on item: weight; cohort: Cohort A; visit 2 / 3.
#> Currently on item: weight; cohort: Cohort A; visit 3 / 3.
#> Currently on item: weight; cohort: Cohort B; visit 1 / 1.
#> Currently on item: weight; cohort: Cohort C; visit 1 / 1.
#> Currently on item: education; cohort: Cohort A; visit 1 / 1.
#> Currently on item: education; cohort: Cohort B; visit 1 / 1.
#> Currently on item: education; cohort: Cohort C; visit 1 / 1.
#> [1] "Finished!"
#>
#> # Harmonization status ----------------------------
#>
#>
#> ## Successfully harmonized ------------------------
#>
#> Number of rows in harmonization sheet successfully harmonized:
#> 16 / 16
#>
#>
#> ## NOT successfully harmonized --------------------
#>
#> Number of rows in harmonization sheet NOT successfully harmonized:
#> 0 / 16
#>
#>
#> # Values outside of range -------------------------
#>
#>
#> ## Numeric variables ------------------------------
#>
#> Number of numeric rows with values set to NA:
#> 0 / 0
#>
#>
#> ## Categorical variables --------------------------
#>
#> Number of categorical rows with values set to NA:
#> 0 / 0
The function will return multiple items in a list. You can extract
data frames from the list with the $
operator and by
referring to them by their name.
Possible items include:
The long dataset will have one row per participant, visit, and cohort.
harmonized_long_dataset <- harmonization_object$long_dataset
head(harmonized_long_dataset) %>%
kable()
cohort | ID | visit | source_age | age | source_height | height | source_weight | weight | source_education | education |
---|---|---|---|---|---|---|---|---|---|---|
Cohort A | 1001 | 1 | 56 | 56 | 65.72971 | 166.9535 | 159.6694 | 72.41241 | 3 | High school |
Cohort A | 1002 | 1 | 55 | 55 | 65.17160 | 165.5359 | 160.8041 | 72.92705 | 3 | High school |
Cohort A | 1003 | 1 | 55 | 55 | 63.03661 | 160.1130 | 162.1213 | 73.52440 | 3 | High school |
Cohort A | 1004 | 1 | 56 | 56 | 66.49534 | 168.8982 | 159.4714 | 72.32261 | 1 | No education/grade school |
Cohort A | 1005 | 1 | 55 | 55 | 63.95676 | 162.4502 | 160.1953 | 72.65094 | 4 | High school |
Cohort A | 1006 | 1 | 55 | 55 | 64.71709 | 164.3814 | 159.6868 | 72.42033 | 5 | College |
The column ID
is the participant id. If the source data
is longitudinal and has multiple visits per patient, that participant ID
will have multiple rows of data in the long dataset.
For example the patients in cohort_a
have multiple
visits, so they will have multiple rows in the long dataset.
cohort | ID | visit | source_age | age | source_height | height | source_weight | weight | source_education | education |
---|
The wide dataset will have one row per participant. The visit number will be added to the variable name as a suffix after an underscore.
harmonized_wide_dataset <- harmonization_object$wide_dataset
head(harmonized_wide_dataset) %>%
kable()
cohort | ID | source_age_1 | source_age_2 | source_age_3 | age_1 | age_2 | age_3 | source_height_1 | source_height_2 | source_height_3 | height_1 | height_2 | height_3 | source_weight_1 | source_weight_2 | source_weight_3 | weight_1 | weight_2 | weight_3 | source_education_1 | source_education_2 | source_education_3 | education_1 | education_2 | education_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cohort A | 1001 | 56 | NA | NA | 56 | NA | NA | 65.72971 | 65.78569 | 65.88767 | 166.9535 | 167.0956 | 167.3547 | 159.6694 | 159.3111 | 161.0290 | 72.41241 | 72.24995 | 73.02904 | 3 | NA | NA | High school | NA | NA |
Cohort A | 1002 | 55 | NA | NA | 55 | NA | NA | 65.17160 | 63.36848 | 63.47919 | 165.5359 | 160.9559 | 161.2371 | 160.8041 | 160.0522 | 159.1563 | 72.92705 | 72.58604 | 72.17972 | 3 | NA | NA | High school | NA | NA |
Cohort A | 1003 | 55 | NA | NA | 55 | NA | NA | 63.03661 | 64.25128 | 64.85743 | 160.1130 | 163.1982 | 164.7379 | 162.1213 | 160.5126 | 160.1114 | 73.52440 | 72.79484 | 72.61286 | 3 | NA | NA | High school | NA | NA |
Cohort A | 1004 | 56 | NA | NA | 56 | NA | NA | 66.49534 | 65.85359 | 64.45130 | 168.8982 | 167.2681 | 163.7063 | 159.4714 | 159.1816 | 158.8120 | 72.32261 | 72.19121 | 72.02360 | 1 | NA | NA | No education/grade school | NA | NA |
Cohort A | 1005 | 55 | NA | NA | 55 | NA | NA | 63.95676 | 65.52275 | 65.39563 | 162.4502 | 166.4278 | 166.1049 | 160.1953 | 159.6045 | 160.3091 | 72.65094 | 72.38297 | 72.70253 | 4 | NA | NA | High school | NA | NA |
Cohort A | 1006 | 55 | NA | NA | 55 | NA | NA | 64.71709 | 65.62382 | 64.63992 | 164.3814 | 166.6845 | 164.1854 | 159.6868 | 159.2520 | 160.1841 | 72.42033 | 72.22313 | 72.64586 | 5 | NA | NA | College | NA | NA |
The error log will have a status that indicates whether a specific variable was successfully harmonized.
Note: The error log will only be able to detect “processing” errors, and not “content” errors. For example, if the user enters coding instructions that are nonsensical or incorrect, but are still able to be executed, this function will not be able to detect it.
The error report will create an html report showing any issues with the harmonization process.
The summary report will create an html report showing summary statistics of your harmonized dataset. The harmonized object will be the input.
create_summary_report(harmonization_object = harmonization_object, path = './output/', file = 'Summary_report')
The output of the summary report should look like the following: