Introduction to psHarmonize


library(dplyr)
library(knitr)
library(stringr)
library(tidyr)
library(glue)
library(purrr)

library(psHarmonize)

The psHarmonize package provides functions that makes harmonizing multiple cohorts easier.

The main function is the harmonization() function. It takes an harmonization sheet as it’s input, and outputs a list of objects based on what you’ve requested.


Harmonization sheet



The harmonization sheet will serve as a set of instructions. It lists the source datasets, source variables (columns), and what modifications (if any) you will require.

head(harmonization_sheet_example) %>%
  kable()
id_var item study domain subdomain source_dataset source_item visit code1 code_type coding_notes possible_range
idvar age Cohort A Demographics Age cohort_a age 1 NA NA No change needed NA
idvar height Cohort A Demographics Height cohort_a height_1 1 x * 2.54 function Convert from inches to cm NA
idvar height Cohort A Demographics Height cohort_a height_2 2 x * 2.54 function Convert from inches to cm NA
idvar height Cohort A Demographics Height cohort_a height_3 3 x * 2.54 function Convert from inches to cm NA
idvar weight Cohort A Demographics Weight cohort_a weight_1 1 x / 2.205 function Converting from lbs to kg NA
idvar weight Cohort A Demographics Weight cohort_a weight_2 2 x / 2.205 function Converting from lbs to kg NA


It contains the following columns:

Column Description
id_var Name of participant ID variable in source dataset. This will be renamed to ID in harmonized dataset.
item New variable name
study Name of cohort
domain Category name
subdomain Sub category name
source_dataset Source dataset name
source_item Existing variable name in source dataset
visit Visit number
code1 Code or instructions to modify original value
code_type “recode category” or “function”
coding_notes Notes to describe coding instructions
possible_range Range of numeric values that are valid for this variable. (Example: [5, 100])


Cohort data



Three sample datasets have been provided with the psHarmonize package.

head(cohort_a) %>%
  kable()
idvar age height_1 weight_1 education height_2 weight_2 height_3 weight_3
1001 56 65.72971 159.6694 3 65.78569 159.3111 65.88767 161.0290
1002 55 65.17160 160.8041 3 63.36848 160.0522 63.47919 159.1563
1003 55 63.03661 162.1213 3 64.25128 160.5126 64.85743 160.1114
1004 56 66.49534 159.4714 1 65.85359 159.1816 64.45130 158.8120
1005 55 63.95676 160.1953 4 65.52275 159.6045 65.39563 160.3091
1006 55 64.71709 159.6868 5 65.62382 159.2520 64.63992 160.1841

head(cohort_b) %>%
  kable()
ID Age hgt_in wgt_kg edu_cat
2543 76 69.86692 69.16364 1
2544 75 69.44474 67.55916 1
2545 75 71.05661 68.81257 3
2546 76 70.20290 67.72010 1
2547 75 70.84808 69.09049 2
2548 75 71.58526 68.65528 4

head(cohort_c) %>%
  kable()
cohort_id age height_cm weight_lbs edu
1054 74 179.0202 164.4654 3
1055 73 178.8935 164.6278 1
1056 75 178.6693 163.4455 2
1057 74 178.3882 165.0523 2
1058 76 178.0217 163.8981 1
1059 75 179.2889 164.5375 3


Cohort A

Cohort A has 10,000 participants, and 3 visits. The height is measured in inches, and the weight is measured in lbs.

Cohort A’s education categories are as follows:

Code Description
1 No education
2 Completed grade school
3 Jr-High School
4 Completed High School
5 Some college

Cohort B

Cohort B has 5,000 participants, and 1 visit. The height in measured in inches, and the weight is measured in kg.

Cohort B’s education categories are as follows:

Code Description
1 Grade school
2 High school
3 College
4 Graduate or professional school

Cohort C

Cohort C has 7,000 participants, and 1 visit. The height is measured in cm, and the weight is measured in lbs.

Cohort C’s education categories are as follows:

Code Description
1 Grade school
2 High school
3 Associate’s degree
4 Bachelor’s degree


Harmonization process



If we want to be able to pool data from these disparate cohorts together we will have to convert or recode some of the values in the original datasets.

For example, since the cohorts all have different units for continuous measurements (like height and weight), we will have to convert these values so they have the similar units across cohorts (cm and kg respectively). This will be handed with the function code type in the harmonization function.

Categorical values will have to be collapsed into similar values. Education will have to be recoded into groupings that appropriate account for the original values. This will be handed with the recode category code type in the harmonization function.


Creating harmonization sheet



The harmonization sheet is the input to the harmonization function. It is essentially a set of directions on how to modify data in order to create a harmonized dataset. This modification can be in the form of a function, recode, or no modification.


Calling harmonization function



When the harmonization function is called, the current cohort, subdomain, and visit is printed to the console.



harmonization_object <- harmonization(harmonization_sheet = harmonization_sheet_example, 
                          long_dataset = TRUE, 
                          wide_dataset = TRUE,
                          error_log = TRUE, 
                          source_variables = TRUE)
#> Currently on item: age; cohort: Cohort A; visit 1 / 1.
#> Currently on item: age; cohort: Cohort B; visit 1 / 1.
#> Currently on item: age; cohort: Cohort C; visit 1 / 1.
#> Currently on item: height; cohort: Cohort A; visit 1 / 3.
#> Currently on item: height; cohort: Cohort A; visit 2 / 3.
#> Currently on item: height; cohort: Cohort A; visit 3 / 3.
#> Currently on item: height; cohort: Cohort B; visit 1 / 1.
#> Currently on item: height; cohort: Cohort C; visit 1 / 1.
#> Currently on item: weight; cohort: Cohort A; visit 1 / 3.
#> Currently on item: weight; cohort: Cohort A; visit 2 / 3.
#> Currently on item: weight; cohort: Cohort A; visit 3 / 3.
#> Currently on item: weight; cohort: Cohort B; visit 1 / 1.
#> Currently on item: weight; cohort: Cohort C; visit 1 / 1.
#> Currently on item: education; cohort: Cohort A; visit 1 / 1.
#> Currently on item: education; cohort: Cohort B; visit 1 / 1.
#> Currently on item: education; cohort: Cohort C; visit 1 / 1.
#> [1] "Finished!"
#> 
#> # Harmonization status ----------------------------
#> 
#> 
#> ## Successfully harmonized ------------------------ 
#> 
#> Number of rows in harmonization sheet successfully harmonized:  
#>  16 / 16 
#> 
#> 
#> ## NOT successfully harmonized -------------------- 
#> 
#> Number of rows in harmonization sheet NOT successfully harmonized:  
#>  0 / 16 
#> 
#> 
#> # Values outside of range -------------------------
#> 
#> 
#> ## Numeric variables ------------------------------ 
#> 
#> Number of numeric rows with values set to NA:  
#>  0 / 0 
#> 
#> 
#> ## Categorical variables -------------------------- 
#> 
#> Number of categorical rows with values set to NA:  
#>  0 / 0


Extracting harmonization objects



The function will return multiple items in a list. You can extract data frames from the list with the $ operator and by referring to them by their name.

Possible items include:


Long dataset



The long dataset will have one row per participant, visit, and cohort.


harmonized_long_dataset <- harmonization_object$long_dataset

head(harmonized_long_dataset) %>%
  kable()
cohort ID visit source_age age source_height height source_weight weight source_education education
Cohort A 1001 1 56 56 65.72971 166.9535 159.6694 72.41241 3 High school
Cohort A 1002 1 55 55 65.17160 165.5359 160.8041 72.92705 3 High school
Cohort A 1003 1 55 55 63.03661 160.1130 162.1213 73.52440 3 High school
Cohort A 1004 1 56 56 66.49534 168.8982 159.4714 72.32261 1 No education/grade school
Cohort A 1005 1 55 55 63.95676 162.4502 160.1953 72.65094 4 High school
Cohort A 1006 1 55 55 64.71709 164.3814 159.6868 72.42033 5 College

The column ID is the participant id. If the source data is longitudinal and has multiple visits per patient, that participant ID will have multiple rows of data in the long dataset.

For example the patients in cohort_a have multiple visits, so they will have multiple rows in the long dataset.


harmonized_long_dataset %>%
  filter(cohort == 'cohort_a') %>%
  arrange(visit) %>%
  head() %>%
  kable()
cohort ID visit source_age age source_height height source_weight weight source_education education


Wide dataset



The wide dataset will have one row per participant. The visit number will be added to the variable name as a suffix after an underscore.


harmonized_wide_dataset <- harmonization_object$wide_dataset

head(harmonized_wide_dataset) %>%
  kable()
cohort ID source_age_1 source_age_2 source_age_3 age_1 age_2 age_3 source_height_1 source_height_2 source_height_3 height_1 height_2 height_3 source_weight_1 source_weight_2 source_weight_3 weight_1 weight_2 weight_3 source_education_1 source_education_2 source_education_3 education_1 education_2 education_3
Cohort A 1001 56 NA NA 56 NA NA 65.72971 65.78569 65.88767 166.9535 167.0956 167.3547 159.6694 159.3111 161.0290 72.41241 72.24995 73.02904 3 NA NA High school NA NA
Cohort A 1002 55 NA NA 55 NA NA 65.17160 63.36848 63.47919 165.5359 160.9559 161.2371 160.8041 160.0522 159.1563 72.92705 72.58604 72.17972 3 NA NA High school NA NA
Cohort A 1003 55 NA NA 55 NA NA 63.03661 64.25128 64.85743 160.1130 163.1982 164.7379 162.1213 160.5126 160.1114 73.52440 72.79484 72.61286 3 NA NA High school NA NA
Cohort A 1004 56 NA NA 56 NA NA 66.49534 65.85359 64.45130 168.8982 167.2681 163.7063 159.4714 159.1816 158.8120 72.32261 72.19121 72.02360 1 NA NA No education/grade school NA NA
Cohort A 1005 55 NA NA 55 NA NA 63.95676 65.52275 65.39563 162.4502 166.4278 166.1049 160.1953 159.6045 160.3091 72.65094 72.38297 72.70253 4 NA NA High school NA NA
Cohort A 1006 55 NA NA 55 NA NA 64.71709 65.62382 64.63992 164.3814 166.6845 164.1854 159.6868 159.2520 160.1841 72.42033 72.22313 72.64586 5 NA NA College NA NA


Error log



The error log will have a status that indicates whether a specific variable was successfully harmonized.

error_log <- harmonization_object$error_log

table(error_log$completed_status)
#> 
#> Completed 
#>        16


Note: The error log will only be able to detect “processing” errors, and not “content” errors. For example, if the user enters coding instructions that are nonsensical or incorrect, but are still able to be executed, this function will not be able to detect it.


Creating reports




Error report



The error report will create an html report showing any issues with the harmonization process.


create_error_log_report(harmonized_object, path = './output/', file = 'Error_log.html')


Summary report



The summary report will create an html report showing summary statistics of your harmonized dataset. The harmonized object will be the input.


create_summary_report(harmonization_object = harmonization_object, path = './output/', file = 'Summary_report')


The output of the summary report should look like the following:


Summary output example.
Summary output example.