The psHarmonize package provides functions that makes harmonizing multiple cohorts easier.

The main function is the harmonization() function. It takes an harmonization sheet as it’s input, and outputs a list of objects based on what you’ve requested.

Harmonization sheet

The harmonization sheet will serve as a set of instructions. It lists the source datasets, source variables (columns), and what modifications (if any) you will require.

head(harmonization_sheet_example) %>%
  kable()

id_var	item	study	domain	subdomain	source_dataset	source_item	visit	code1	code_type	coding_notes	possible_range
idvar	age	Cohort A	Demographics	Age	cohort_a	age	1	NA	NA	No change needed	NA
idvar	height	Cohort A	Demographics	Height	cohort_a	height_1	1	x * 2.54	function	Convert from inches to cm	NA
idvar	height	Cohort A	Demographics	Height	cohort_a	height_2	2	x * 2.54	function	Convert from inches to cm	NA
idvar	height	Cohort A	Demographics	Height	cohort_a	height_3	3	x * 2.54	function	Convert from inches to cm	NA
idvar	weight	Cohort A	Demographics	Weight	cohort_a	weight_1	1	x / 2.205	function	Converting from lbs to kg	NA
idvar	weight	Cohort A	Demographics	Weight	cohort_a	weight_2	2	x / 2.205	function	Converting from lbs to kg	NA

It contains the following columns:

Column	Description
id_var	Name of participant ID variable in source dataset. This will be renamed to `ID` in harmonized dataset.
item	New variable name
study	Name of cohort
domain	Category name
subdomain	Sub category name
source_dataset	Source dataset name
source_item	Existing variable name in source dataset
visit	Visit number
code1	Code or instructions to modify original value
code_type	“recode category” or “function”
coding_notes	Notes to describe coding instructions
possible_range	Range of numeric values that are valid for this variable. (Example: `[5, 100]`)

Cohort data

Three sample datasets have been provided with the psHarmonize package.

head(cohort_a) %>%
  kable()

idvar	age	height_1	weight_1	education	height_2	weight_2	height_3	weight_3
1001	56	65.72971	159.6694	3	65.78569	159.3111	65.88767	161.0290
1002	55	65.17160	160.8041	3	63.36848	160.0522	63.47919	159.1563
1003	55	63.03661	162.1213	3	64.25128	160.5126	64.85743	160.1114
1004	56	66.49534	159.4714	1	65.85359	159.1816	64.45130	158.8120
1005	55	63.95676	160.1953	4	65.52275	159.6045	65.39563	160.3091
1006	55	64.71709	159.6868	5	65.62382	159.2520	64.63992	160.1841


head(cohort_b) %>%
  kable()

ID	Age	hgt_in	wgt_kg	edu_cat
2543	76	69.86692	69.16364	1
2544	75	69.44474	67.55916	1
2545	75	71.05661	68.81257	3
2546	76	70.20290	67.72010	1
2547	75	70.84808	69.09049	2
2548	75	71.58526	68.65528	4


head(cohort_c) %>%
  kable()

cohort_id	age	height_cm	weight_lbs	edu
1054	74	179.0202	164.4654	3
1055	73	178.8935	164.6278	1
1056	75	178.6693	163.4455	2
1057	74	178.3882	165.0523	2
1058	76	178.0217	163.8981	1
1059	75	179.2889	164.5375	3

Cohort A

Cohort A has 10,000 participants, and 3 visits. The height is measured in inches, and the weight is measured in lbs.

Cohort A’s education categories are as follows:

Code	Description
1	No education
2	Completed grade school
3	Jr-High School
4	Completed High School
5	Some college

Cohort B

Cohort B has 5,000 participants, and 1 visit. The height in measured in inches, and the weight is measured in kg.

Cohort B’s education categories are as follows:

Code	Description
1	Grade school
2	High school
3	College
4	Graduate or professional school

Cohort C

Cohort C has 7,000 participants, and 1 visit. The height is measured in cm, and the weight is measured in lbs.

Cohort C’s education categories are as follows:

Code	Description
1	Grade school
2	High school
3	Associate’s degree
4	Bachelor’s degree

Harmonization process

If we want to be able to pool data from these disparate cohorts together we will have to convert or recode some of the values in the original datasets.

For example, since the cohorts all have different units for continuous measurements (like height and weight), we will have to convert these values so they have the similar units across cohorts (cm and kg respectively). This will be handed with the function code type in the harmonization function.

Categorical values will have to be collapsed into similar values. Education will have to be recoded into groupings that appropriate account for the original values. This will be handed with the recode category code type in the harmonization function.

Creating harmonization sheet

The harmonization sheet is the input to the harmonization function. It is essentially a set of directions on how to modify data in order to create a harmonized dataset. This modification can be in the form of a function, recode, or no modification.

Calling harmonization function

When the harmonization function is called, the current cohort, subdomain, and visit is printed to the console.


harmonization_object <- harmonization(harmonization_sheet = harmonization_sheet_example, 
                          long_dataset = TRUE, 
                          wide_dataset = TRUE,
                          error_log = TRUE, 
                          source_variables = TRUE)
#> Currently on item: age; cohort: Cohort A; visit 1 / 1.
#> Currently on item: age; cohort: Cohort B; visit 1 / 1.
#> Currently on item: age; cohort: Cohort C; visit 1 / 1.
#> Currently on item: height; cohort: Cohort A; visit 1 / 3.
#> Currently on item: height; cohort: Cohort A; visit 2 / 3.
#> Currently on item: height; cohort: Cohort A; visit 3 / 3.
#> Currently on item: height; cohort: Cohort B; visit 1 / 1.
#> Currently on item: height; cohort: Cohort C; visit 1 / 1.
#> Currently on item: weight; cohort: Cohort A; visit 1 / 3.
#> Currently on item: weight; cohort: Cohort A; visit 2 / 3.
#> Currently on item: weight; cohort: Cohort A; visit 3 / 3.
#> Currently on item: weight; cohort: Cohort B; visit 1 / 1.
#> Currently on item: weight; cohort: Cohort C; visit 1 / 1.
#> Currently on item: education; cohort: Cohort A; visit 1 / 1.
#> Currently on item: education; cohort: Cohort B; visit 1 / 1.
#> Currently on item: education; cohort: Cohort C; visit 1 / 1.
#> [1] "Finished!"
#> 
#> # Harmonization status ----------------------------
#> 
#> 
#> ## Successfully harmonized ------------------------ 
#> 
#> Number of rows in harmonization sheet successfully harmonized:  
#>  16 / 16 
#> 
#> 
#> ## NOT successfully harmonized -------------------- 
#> 
#> Number of rows in harmonization sheet NOT successfully harmonized:  
#>  0 / 16 
#> 
#> 
#> # Values outside of range -------------------------
#> 
#> 
#> ## Numeric variables ------------------------------ 
#> 
#> Number of numeric rows with values set to NA:  
#>  0 / 0 
#> 
#> 
#> ## Categorical variables -------------------------- 
#> 
#> Number of categorical rows with values set to NA:  
#>  0 / 0

Extracting harmonization objects

The function will return multiple items in a list. You can extract data frames from the list with the $ operator and by referring to them by their name.

Possible items include:

long_dataset
wide_dataset
error_log

Long dataset

The long dataset will have one row per participant, visit, and cohort.


harmonized_long_dataset <- harmonization_object$long_dataset

head(harmonized_long_dataset) %>%
  kable()

cohort	ID	visit	source_age	age	source_height	height	source_weight	weight	source_education	education
Cohort A	1001	1	56	56	65.72971	166.9535	159.6694	72.41241	3	High school
Cohort A	1002	1	55	55	65.17160	165.5359	160.8041	72.92705	3	High school
Cohort A	1003	1	55	55	63.03661	160.1130	162.1213	73.52440	3	High school
Cohort A	1004	1	56	56	66.49534	168.8982	159.4714	72.32261	1	No education/grade school
Cohort A	1005	1	55	55	63.95676	162.4502	160.1953	72.65094	4	High school
Cohort A	1006	1	55	55	64.71709	164.3814	159.6868	72.42033	5	College

The column ID is the participant id. If the source data is longitudinal and has multiple visits per patient, that participant ID will have multiple rows of data in the long dataset.

For example the patients in cohort_a have multiple visits, so they will have multiple rows in the long dataset.


harmonized_long_dataset %>%
  filter(cohort == 'cohort_a') %>%
  arrange(visit) %>%
  head() %>%
  kable()

cohort	ID	visit	source_age	age	source_height	height	source_weight	weight	source_education	education

Wide dataset

The wide dataset will have one row per participant. The visit number will be added to the variable name as a suffix after an underscore.


harmonized_wide_dataset <- harmonization_object$wide_dataset

head(harmonized_wide_dataset) %>%
  kable()

cohort	ID	source_age_1	source_age_2	source_age_3	age_1	age_2	age_3	source_height_1	source_height_2	source_height_3	height_1	height_2	height_3	source_weight_1	source_weight_2	source_weight_3	weight_1	weight_2	weight_3	source_education_1	source_education_2	source_education_3	education_1	education_2	education_3
Cohort A	1001	56	NA	NA	56	NA	NA	65.72971	65.78569	65.88767	166.9535	167.0956	167.3547	159.6694	159.3111	161.0290	72.41241	72.24995	73.02904	3	NA	NA	High school	NA	NA
Cohort A	1002	55	NA	NA	55	NA	NA	65.17160	63.36848	63.47919	165.5359	160.9559	161.2371	160.8041	160.0522	159.1563	72.92705	72.58604	72.17972	3	NA	NA	High school	NA	NA
Cohort A	1003	55	NA	NA	55	NA	NA	63.03661	64.25128	64.85743	160.1130	163.1982	164.7379	162.1213	160.5126	160.1114	73.52440	72.79484	72.61286	3	NA	NA	High school	NA	NA
Cohort A	1004	56	NA	NA	56	NA	NA	66.49534	65.85359	64.45130	168.8982	167.2681	163.7063	159.4714	159.1816	158.8120	72.32261	72.19121	72.02360	1	NA	NA	No education/grade school	NA	NA
Cohort A	1005	55	NA	NA	55	NA	NA	63.95676	65.52275	65.39563	162.4502	166.4278	166.1049	160.1953	159.6045	160.3091	72.65094	72.38297	72.70253	4	NA	NA	High school	NA	NA
Cohort A	1006	55	NA	NA	55	NA	NA	64.71709	65.62382	64.63992	164.3814	166.6845	164.1854	159.6868	159.2520	160.1841	72.42033	72.22313	72.64586	5	NA	NA	College	NA	NA

Error log

The error log will have a status that indicates whether a specific variable was successfully harmonized.

error_log <- harmonization_object$error_log

table(error_log$completed_status)
#> 
#> Completed 
#>        16

Note: The error log will only be able to detect “processing” errors, and not “content” errors. For example, if the user enters coding instructions that are nonsensical or incorrect, but are still able to be executed, this function will not be able to detect it.

Creating reports

Error report

The error report will create an html report showing any issues with the harmonization process.


create_error_log_report(harmonized_object, path = './output/', file = 'Error_log.html')

Summary report

The summary report will create an html report showing summary statistics of your harmonized dataset. The harmonized object will be the input.


create_summary_report(harmonization_object = harmonization_object, path = './output/', file = 'Summary_report')

The output of the summary report should look like the following:

Summary output example.

Introduction to psHarmonize