The goal of gcxgclab is to provide a comprehensive program for preprocessing and analysis of two dimensional gas chromatography data. It is equipped with functions for baseline correction, smoothing, peak identification, peak alignment, identification of EICs, Mass Spectra, targeted analysis, compound identification with NIST and non-targeted analysis, plus plotting and data visualization.
You can install the development version of gcxgclab from CRAN with:
install.packages("gcxgclab")
If using gcxgclab in your work, please cite us.
Gamble, Stephanie & Granger, Caroline & Mannion, Joseph. gcxgclab: An R-package for two dimensional gas chromatography preprocessing and analysis. in preparation. 2024.
Provided here is an example of the work flow to fully process and analyze an example GCxGC sample.
First, load the package.
library(gcxgclab)
Load the first file. This should be a cdf file containing the processed GCxGC sample data.
Extract data from the file and create data frames for the total ion chromatogram (TIC) data and full mass and intensity data.
<- system.file('extdata','Sample1.cdf',package='gcxgclab')
filename <- extract_data(filename) rt_frame
To visualize the data, graph a contour map of original, raw TIC data. The title may be changed to any string. The daata can be plotted in either 1 of 2 time dimensions. 2 is shown.
plot_chr(rt_frame, title='Raw Data', scale='linear', dim=2)
For the scale of this data, it is useful to visualize with a contour plot in log base 10 of the TIC data. This is done by changing the scale input from linear to log (log is the default).
plot_chr(rt_frame, title='Log Intensity', scale="log")
If desired, the phase may be shifted up or down. Shifting the phase in the negative direction (down) by 2 seconds.
<- phase_shift(rt_frame, shift=-2)
shifted plot_chr(shifted, title='Shifted')
Performing Whittaker smoothing. Lambda=20 has been chosen for the smoothing coefficient (which is also the default value). Larger values of lambda indicate more smoothing. Lambda must be chosen between 0 and 1000. Plotting in log scale.
<- smooth(rt_frame, lambda=20)
sm_frame plot_chr(sm_frame, title = 'Smoothed')
Performing baseline correction. Gamma = 0.5 (which is also the default value), indicating a moderate amount of noise to be considered baseline. Gamma must be chosen between 0 and 1. 0 results in almost no data being subtracted to the baseline, and 1 results in a high amount of the data being subtracted to the baseline. Plotting in log scale.
<- bl_corr(sm_frame, gamma=0.5)
frame plot_chr(frame, title = 'Baseline Corrected')
It is helpful to identify peaks in the samples. There are two functions to locate the peaks. One finds the top N peaks, and the second finds all peaks above a given threshold.
Finding the top 20 peaks in the sample. Then finding all peaks above an intensity of 100,000.
<- top_peaks(frame$TIC_df, N=20)
peaks <- thr_peaks(frame$TIC_df, THR=100000) thrpeaks
Now to visualize these peaks. Ploting red circles over the highest 20 peaks on the chromatogram image.
plot_peak(peaks, frame, title="Top 20 Peaks")
Plot circles at the highest 20 peaks, size indicating the intensity.
plot_peakonly(peaks, title="Top 20 Peaks")
Ploting red circles over all peaks above 100,000 threshold on the chromatogram image.
plot_peak(thrpeaks, frame, title="Peaks Above 100,000")
Plot circles at all peaks above 10,000 threshold, size indicating the intensity.
plot_peakonly(thrpeaks, title="Peaks Above 100,000")
Loading a second cdf file and performing smoothing and baseline correction for alignment to the first sample.
<- system.file('extdata','Sample2.cdf',package='gcxgclab')
filename2 <- extract_data(filename2)
rt_frame2 <- smooth(rt_frame2, lambda=20)
sm_frame2 <- bl_corr(sm_frame2, gamma=0.5)
frame2
<- top_peaks(frame2$TIC_df, N=20)
peaks2 plot_peak(peaks2, frame, title="Top 20 Peaks, Sample 2")
Loading a third cdf file and performing smoothing and baseline correction for alignment to the first sample.
<- system.file('extdata','Sample3.cdf',package='gcxgclab')
filename3 <- extract_data(filename3)
rt_frame3 <- smooth(rt_frame3, lambda=20)
sm_frame3 <- bl_corr(sm_frame3, gamma=0.5)
frame3
<- top_peaks(frame3$TIC_df, N=20)
peaks3 plot_peak(peaks3, frame, title="Top 20 Peaks, Sample 3")
Performing batch alignment to a reference sample. “Sample 1” is given as the reference sample. This keeps all peaks from reference sample, then aligns corresponding peaks from each of the other samples, returns a list of data frames for each, keeping peak height unchanged, only coordinate shifted to align with reference peak.
<- align(list(frame,frame2,frame3))
aligned_list for (i in 1:3){
print(plot_peak(aligned_list$peaks[[i]], aligned_list$data[[i]], title=paste("Aligned Sample", i)))
}
In the analysis of this data, there are many times where fitting to a Gaussian curve is used. Fitting the highest peak in the sample to a 1D Gaussian curve. The area under the curve is calculated. Plot of points in blue and fitted curve in red.
<- c(peaks$'X'[1], peaks$'Y'[1])
peak_t <- gauss_fit(frame$TIC_df, peakcoord=peak_t) gaussfit
#> Area under curve = 2208363.21987165 u^2
plot_gauss(frame$TIC_df, gaussfit[[1]])
Fitting the highest peak in the sample to a 2D Gaussian curve. The volume under the curve is calculated. Contour plot of the fitted curve.
<- gauss2_fit(frame$TIC_df, peakcoord=peak_t) gaussfit2
#> Volume under curve = 16457652.920234 u^3
plot_gauss2(frame$TIC_df, gaussfit2[[1]])
Finding extracted ion chromatogram (EIC) for the specific masses of interest 141.9274 u. EIC visualization can be done in 1 or 2 dimensions.
<- find_eic(frame, MOI=141.9274)
eic plot_eic(eic,dim=1)
plot_eic(eic,dim=2)
Performing batch EIC extraction on an input list of masses: 141.9274 u, 92.1397 u, and 78.1130 u.
<- c(141.9274, 92.1397, 78.1130)
MOIs <- batch_eic(frame, MOIs=MOIs)
eics for (i in 1:length(eics)){
print(plot_eic(eics[[i]], dim=1,
title=paste("EIC for MOI",MOIs[i])))}
Finding mass spectral (MS) data at the highest peak in the TIC data. Plotting a line graph of the MS data.
<- find_ms(frame, t_peak=peaks$'T'[1])
mz plot_ms(mz)
Plotting the Kendrick Mass Defect of the mass spectrum data compared with CH_2 (compount mass = 14.01565 for CH_2 is default, can be specified to a different ion if desired).
plot_defect(mz, compound_mass = 14.01565, title="Kendrick Mass Defect, CH_2")
Performing batch MS extraction on the highest 5 peaks in the TIC data.
<- batch_ms(frame, t_peaks = peaks$'T'[1:3])
mzs for (i in 1:length(mzs)){
print(plot_ms(mzs[[i]],
title=paste('Mass Spectrum of peak', i)))}
Identifying the compound of the highest TIC peak with the NIST MS database.
Note, this only works if the user has a NIST MS Library license already purchased. First the NIST Library is loaded, which in this case is in seven different MSP files. Then comparing the MS data to the loaded NIST Library.
<-nist_list('NIST17_01.MSP','NIST17_02.MSP','NIST17_03.MSP','NIST17_04.MSP','NIST17_05.MSP','NIST17_06.MSP','NIST17_07.MSP')
nistlist comp_nist(nistlist, mz)
#> The best NIST match is Toluene, Formula: C7H8. 86.98% match.
#> Name Formula Index Percent Match
#> 1 Toluene C7H8 2724 86.98145
#> 2 Toluene C7H8 2726 80.48633
#> 3 Toluene C7H8 2720 77.81422
#> 4 Toluene C7H8 2721 76.94814
#> 5 Toluene C7H8 2723 73.52366
#> 6 Toluene C7H8 2725 71.89045
#> 7 Spiro[2,4]hepta-4,6-diene C7H8 2744 71.35828
#> 8 Toluene C7H8 2722 70.21886
#> 9 Benzene, (butoxymethyl)- C11H16O 37871 69.14554
#> 10 Benzene, (butoxymethyl)- C11H16O 37868 69.07512
#> 10 Benzene, (butoxymethyl)- C11H16O 37868 69.07512
Performing non-targeted analysis, identifying compound with the exact mass data. The minimum threshold for peaks is set to 0.05 (which is default).
<- mass_list()
masslist non_targeted(masslist, mz, THR=0.05)
#> The best mass solution found is C7H8.
#> Compound Mass
#> 1 C7H8 92.0626
#> 2 C3H10NO2 92.07115
#> 3 C2H8N2O2 92.05858
#> 4 C2H12N2Si 92.07697
#> 5 C2H10NOSi 92.05317
#> 6 C6H6N 92.05002
#> 7 C2H10N3O 92.08239
#> 8 C4H12O2 92.08373
#> 9 C2H12Si2 92.04775
#> 10 C3H8O3 92.04734
Performing targeted analysis to identify peak areas for MOIs at given retention times. MOIs = c(141.9274, 92.1397, 92.1397, 78.113) at corresponding retention times RTs = c(3.8273333, 9.111583, 13.782750, 18.5691167). The list of data and the list of MOIs are required inputs. The RTs are not required. If left blank, the program will identify the highest point and use that retention time. There are several other optional inputs. Window size (window_size input variable) may also be given, specified for each MOI/RT pair. A tolerance (tolerance input variable) may be given to specify a tolerance around the MOIs that is to be allowed. Default is 0.005. Lastly, a string may be input (images input variable) for the name of a pdf file where images will be saved. If left blank, plots will be printed, two for each MOI/RT pair: a plot of the entire EIC showing the peak of interest, and a zoomed in plot at the RT of interest.
targeted(list(frame,frame2,frame3), MOIs = c(141.9274, 92.1397, 92.1397, 78.113), RTs = c(3.8273333, 9.111583, 13.782750, 18.5691167),images=TRUE)
#> MOI | 141.9274 92.1397 92.1397 78.113
#> RT | 3.8273333 9.111583 13.78275 18.5691167
#> ---------- | ---------- ---------- ---------- ----------
#> Sample 1 | 34.214245 181.62989 4.26959 10.63571
#> Sample 2 | 9.854254 48.751539 ND 14.603188
#> Sample 3 | 0.632451 43.737769 ND 8.710648
Examples of plots that are displayed while running the targeted function.
Contact stephanie.gamble@srnl.doe.gov
©2022 Battelle Savannah River Alliance, LLC
Notice: These data were produced by Battelle Savannah River Alliance, LLC under Contract No 89303321CEM000080 with the Department of Energy. During the period of commercialization or such other time period specified by DOE, the Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this data to reproduce, prepare derivative works, and perform publicly and display publicly, by or on behalf of the Government. Subsequent to that period, the Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this data to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so. The specific term of the license can be identified by inquiry made to Contract or DOE. NEITHER THE UNITED STATES NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY DATA, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WORLD NOT INFRINGE PRIVATELY OWNED RIGHTS.