---
title: "ctxR: Chemical API"
author: US EPA's Center for Computational Toxicology and Exposure ccte@epa.gov
output:
rmarkdown::html_vignette:
fig_width: 7
fig_height: 6
params:
my_css: css/rmdformats.css
vignette: >
%\VignetteIndexEntry{2. Chemical}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{css, code = readLines(params$my_css), hide=TRUE, echo = FALSE}
```
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(httptest)
start_vignette("2")
```
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
if (!library(ctxR, logical.return = TRUE)){
devtools::load_all()
}
old_options <- options("width")
```
```{r, echo=FALSE, warning=FALSE}
# Used to visualize data in a variety of plot designs
library(ggplot2)
library(gridExtra)
```
```{r setup-print, echo = FALSE}
# Redefining the knit_print method to truncate character values to 25 characters
# in each column and to truncate the columns in the print call to prevent
# wrapping tables with several columns.
#library(ctxR)
knit_print.data.table = function(x, ...) {
y <- data.table::copy(x)
y <- y[, lapply(.SD, function(t){
if (is.character(t)){
t <- strtrim(t, 25)
}
return(t)
})]
print(y, trunc.cols = TRUE)
}
registerS3method(
"knit_print", "data.table", knit_print.data.table,
envir = asNamespace("knitr")
)
```
## Introduction
In this vignette, [CTX Chemical API](https://api-ccte.epa.gov/docs/chemical.html) will be explored.
The foundation of toxicology, toxicokinetics, and exposure is embedded in the physics and chemistry of chemical-biological interactions. The accurate characterization of chemical structure linked to commonly used identifiers, such as names and Chemical Abstracts Service Registry Numbers (CASRNs), is essential to support both predictive modeling of the data as well as dissemination and application of the data for chemical safety decisions.
With cheminformatics as the backbone for research efforts, sources of available data through the CTX Chemical API include:
- Chemical structures, nomenclature, synonyms, IDs, list associations, physicochemical property, environmental fate and transport data from the Distributed Structure-Searchable Toxicity ([DSSTox](https://www.epa.gov/comptox-tools/distributed-structure-searchable-toxicity-dsstox-database)) database. DSSTox substance identifiers (DTXSIDs) support linking chemical information to a specific chemical across a variety of EPA chemical resources. For early references, see [(Richard, A. et al. 2002)](https://doi.org/10.1016/S0027-5107(01)00289-5), [(Richard, A. et al. 2006)](https://files.toxplanet.com/cpdb/pdfs/structure_tox_on_web.pdf), and [(Richard, A. et al 2008)](https://doi.org/10.1080/15376510701857452).
- Predictions from Toxicity Estimation Software Tool ([TEST](https://www.epa.gov/comptox-tools/toxicity-estimation-software-tool-test)) suite of QSAR models. For early references, see [(Martin, T. et al. 2001)](https://pubs.acs.org/doi/10.1021/tx0155045), [(Martin, T. et al. 2007)](https://doi.org/10.1080/15376510701857353), and [(Young, D. et al. 2008)]( https://doi.org/10.1002/qsar.200810084).
More information on Chemicals and Chemistry Data can be found here: .
::: {.noticebox data-latex=""}
**NOTE:** Please see the introductory vignette for an overview of the *ctxR* package and initial set up instruction with API key storage.
:::
Several ctxR functions can be used to access the CTX Chemical API data, as described in the following sections.Tables output in each example have been filtered to only display the first few rows of data.
# Chemical Details Resource
## Get chemical data
`get_chemical_details()` retrieves chemical detail data either using the chemical identifier DTXSID or DTXCID. Alternate parameter "projection" determines the type of data returned. Examples for each are provided below:
### By DTXSID
```{r ctxR dtxsid data chemical, message=FALSE, eval=FALSE}
chemical_details_by_dtxsid <- get_chemical_details(DTXSID = 'DTXSID7020182')
```
### By DTXCID
```{r ctxR dtxcid data chemical, message=FALSE, eval=FALSE}
chemical_details_by_dtxcid <- get_chemical_details(DTXCID = 'DTXCID30182')
```
### By Batch Search
```{r ctxR batch data chemical, message=FALSE, eval=FALSE}
vector_dtxsid<- c("DTXSID7020182", "DTXSID9020112", "DTXSID8021430")
chemical_details_by_batch_dtxsid <- get_chemical_details_batch(DTXSID = vector_dtxsid)
vector_dtxcid <- c("DTXCID30182", "DTXCID801430", "DTXCID90112")
chemical_details_by_batch_dtxcid <- get_chemical_details_batch(DTXCID = vector_dtxcid)
```
# Pubchem Link to GHS classification
## GHS classification
`check_existence_by_dtxsid()` checks if the supplied DTXSID is valid and returns a URL for additional information on the chemical in the case of a valid DTXSID.
### By DTXSID
```{r ctxr dtxsid check, message=FALSE, eval=FALSE}
dtxsid_check_true <- check_existence_by_dtxsid(DTXSID = 'DTXSID7020182')
dtxsid_check_false <- check_existence_by_dtxsid(DTXSID = 'DTXSID7020182f')
```
### By Batch Search
```{r ctxr dtxsid check batch, message=FALSE, eval=FALSE}
vector_dtxsid_and_non_dtxsid <- c('DTXSID7020182F', 'DTXSID7020182', 'DTXSID0020232F')
dtxsid_checks <- check_existence_by_dtxsid_batch(DTXSID = vector_dtxsid_and_non_dtxsid)
```
# Chemical Property Resource
`get_chemical_by_property_range()` retrieves data for chemicals that have a specified property within the input range.
```{r ctxR property range chemical, message=FALSE, eval=FALSE}
chemical_by_property_range <- get_chemical_by_property_range(start = 1.311,
end = 1.313,
property = 'Density')
```
`get_chem_info()` retrieves specific chemical information for an input chemical. This includes both experimental and predicted values by default, but providing "experimental" or "predicted" to the type parameter will return the specific associated information.
```{r ctxR info chemical, message=FALSE, eval=FALSE}
chemical_info <- get_chem_info(DTXSID = 'DTXSID7020182')
```
# Chemical Fate Resource
`get_fate_by_dtxsid()` retrieves chemical fate data.
```{r ctxR fate data chemical, message=FALSE, eval=FALSE}
fate_by_dtxsid <- get_fate_by_dtxsid(DTXSID = 'DTXSID7020182')
```
# Chemical Search Resource
Chemicals can be searched using string values. Examples for each are provided by the following:
## By starting value
```{r ctxR starting value chemical, message=FALSE, eval=FALSE}
search_starts_with <- chemical_starts_with(word = 'DTXSID70201')
```
## By exact value
```{r ctxR exact value chemical, message=FALSE, eval=FALSE}
search_exact <- chemical_equal(word = 'DTXSID7020182')
```
## By substring value
```{r ctxR substring value chemical, message=FALSE, eval=FALSE}
search_contains <- chemical_contains(word = 'DTXSID702018')
```
## Subset for MS-Ready Structures
MS-Ready [(McEachran, A. et al. 2018)](https://doi.org/10.1186/s13321-018-0299-2) data can be retrieved using a variety of input information. Examples for each are provided below:
### By Mass Range
```{r ctxR mass range ms ready chemical, message=FALSE, eval=FALSE}
msready_by_mass <- get_msready_by_mass(start = 200.9,
end = 200.95)
```
### By Chemical Formula
```{r ctxR chemical formula ms ready chemical, message=FALSE, eval=FALSE}
msready_by_formula <- get_msready_by_formula(formula = 'C16H24N2O5S')
```
### By DTXCID
```{r ctxR dtxcid ms ready chemical, message=FALSE, eval=FALSE}
msready_by_dtxcid <- get_msready_by_dtxcid(DTXCID = 'DTXCID30182')
```
# List Resource
There are several lists of chemicals one can access using the [(CCD list search)](https://comptox.epa.gov/dashboard/chemical-lists). These can be filtered by the type, name, inclusion of a specific chemical, or name of list.
## Get all list types
```{r ctxR types of chemical lists, message=FALSE, eval=FALSE}
get_all_list_types()
```
## All lists by type
```{r ctxR all list types chemical, message=FALSE, eval=FALSE}
chemical_lists_by_type <- get_chemical_lists_by_type(type = 'federal')
```
## List by name
```{r ctxR list by name chemical, message=FALSE, eval=FALSE}
public_chemical_list_by_name <- get_public_chemical_list_by_name(listname = 'CCL4')
```
## Lists containing a specific chemical
`get_lists_containing_chemical()` retrieves a list of names of chemical lists, each of which contains the specified chemical.
```{r ctxR lists containing chemical, message=FALSE, eval=FALSE}
lists_containing_chemical <- get_lists_containing_chemical(DTXSID = 'DTXSID7020182')
```
## DTXSIDs for chemical list and starting value
`get_chemicals_in_list_start()` retrieves a list of DTXSIDs for a given starting character string in a specified list of chemicals.
```{r ctxR chemicals-in-list-start, message=FALSE, eval=FALSE}
chemicals_in_ccl4_start <- get_chemicals_in_list_start(list_name = 'CCL4', word = 'Bi')
```
## DTXSIDs for chemical list and exact value
`get_chemicals_in_list_exact()` retrieves a list of DTXSIDs matching exactly a given character string in a specified list of chemicals.
```{r ctxR chemicals-in-list-exact, message=FALSE, eval=FALSE}
chemicals_in_ccl4_exact <- get_chemicals_in_list_exact(list_name = 'BIOSOLIDS2021', word = 'Bisphenol A')
```
## DTXSIDs for chemical list and containing value
`get_chemicals_in_list_contain()` retrieves a list of DTXSIDs that contain a given character string in a specified list of chemicals.
```{r ctxR chemicals-in-list-contain, message=FALSE, eval=FALSE}
chemicals_in_ccl4_contain <- get_chemicals_in_list_contain(list_name = 'CCL4', word = 'Bis')
```
## Chemicals in a specific list
`get_chemicals_in_list()` retrieves the specific chemical information for each chemical contained in the specified list.
```{r ctxR chemical in list chemical, message=FALSE, eval=FALSE}
chemicals_in_list <- get_chemicals_in_list(list_name = 'CCL4')
```
# Chemical File Resource
There are mrv, mol, and image files that can be accessed using either the DTXSID or DTXCID. Examples are provided below:
## Get mrv by DTXSID or DTXCID
`get_chemical_mrv()` retrieves mrv file information for a chemical specified either by DTXSID or DTXCID.
```{r ctxR mrv by dtxsid dtxcid chemical, message=FALSE, eval=FALSE}
chemical_mrv_by_dtxsid <- get_chemical_mrv(DTXSID = 'DTXSID7020182')
chemical_mrv_by_dtxcid <- get_chemical_mrv(DTXCID = 'DTXCID30182')
```
## Get mol by DTXSID or DTXCID
`get_chemical_mol()` retrieves mol file information for a chemical specified either by DTXSID or DTXCID.
```{r ctxR mol by dtxsid dtxcid chemical, message=FALSE, eval=FALSE}
chemical_mol_by_dtxsid <- get_chemical_mol(DTXSID = 'DTXSID7020182')
chemical_mol_by_dtxcid <- get_chemical_mol(DTXCID = 'DTXCID30182')
```
## Get structure image by DTXSID, DTXCID, or SMILES
`get_chemical_image()` retrieves image file information for a chemical specified either by DTXSID or DTXCID. To visualize the returned array of image information, the user may use either the `png::writePNG()` or `countcolors::plotArrayAsImage()` functions, among many choices.
```{r ctxR image by dtxsid dtxcid chemical, message=FALSE, eval=FALSE}
chemical_image_by_dtxsid <- get_chemical_image(DTXSID = 'DTXSID7020182')
chemical_image_by_dtxcid <- get_chemical_image(DTXCID = 'DTXCID30182')
chemical_image_by_smiles <- get_chemical_image(SMILES = 'CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1')
countcolors::plotArrayAsImage(chemical_image_by_dtxsid)
countcolors::plotArrayAsImage(chemical_image_by_dtxcid)
countcolors::plotArrayAsImage(chemical_image_by_smiles)
```
# Chemical Synonym Resource
`get_chemical_synonym()` retrieves synonyms for the specified chemical.
```{r ctxR synonym by dtxsid chemical, message=FALSE, eval=FALSE}
chemical_synonym <- get_chemical_synonym(DTXSID = 'DTXSID7020182')
```
# Example Use Case: Comparing Physico-chemical Properties Across Chemical Lists
The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that "...are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems...." Moreover, this list "...was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants...." The National-Scale Air Toxics Assessments (NATA) is "... EPA's ongoing comprehensive evaluation of air toxics in the United States... a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks... use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them...."
These lists can be found in the CCD at [CCL4](https://comptox.epa.gov/dashboard/chemical-lists/CCL4) with additional information at [CCL4 information](https://www.epa.gov/ccl/contaminant-candidate-list-4-ccl-4-0) and [NATADB](https://comptox.epa.gov/dashboard/chemical-lists/NATADB) with additional information at [NATA information](https://www.epa.gov/national-air-toxics-assessment). The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links.
In this example use case, physico-chemical Properties data will be compared between a water contaminant priority and an air toxics list. Note, the following code chunks use the `data.table` object, which is an extension of the `data.frame` object and has slightly different syntax. For more information, please refer to [data.table](https://CRAN.R-project.org/package=data.table)
## Obtain Lists of Chemicals
First, confirm the chemical list to query.
```{r}
options(width = 100)
ccl4_information <- get_public_chemical_list_by_name('CCL4')
print(ccl4_information, trunc.cols = TRUE)
natadb_information <- get_public_chemical_list_by_name('NATADB')
print(natadb_information, trunc.cols = TRUE)
```
Next, retrieve the list of chemicals associated with each list.
```{r}
ccl4 <- get_chemicals_in_list('ccl4')
ccl4 <- data.table::as.data.table(ccl4)
natadb <- get_chemicals_in_list('NATADB')
natadb <- data.table::as.data.table(natadb)
```
We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.
```{r, eval=FALSE}
dim(ccl4)
dim(natadb)
colnames(ccl4)
head(ccl4, 1)
```
## Batch Search Physico-chemical Property Data by DTXSID for Chemical Lists
Next, physico-chemical properties for all chemicals in each list can be retrieved. The function `get_chem_info()` will be used to batch search for a list of DTXSIDs.
```{r}
ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid)
natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid)
```
Observe that this returns a single data.table for each query, and the data.table contains the physico-chemical properties available from the CompTox Chemicals Dashboard for each chemical in the query. Note, a warning message was triggered, `Warning: Setting type to ''!`, which indicates the the parameter `type` was not given a value. A default value is set within the function and more information can be found in the associated documentation. We examine the set of physico-chemical properties for the first chemical in CCL4.
Before any deeper analysis, consider the dimensions of the data and the column names.
```{r, eval=FALSE}
dim(ccl4_phys_chem)
colnames(ccl4_phys_chem)
```
Next, we display the unique values for the columns `propertyID` and `propType`.
```{r}
ccl4_phys_chem[, unique(propertyId)]
ccl4_phys_chem[, unique(propType)]
```
Let's explore this further by examining the mean of the "boiling-point" and "melting-point" data.
```{r}
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value))]
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value)),
by = .(propType)]
ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value))]
ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value)),
by = .(propType)]
```
These results tell us about some of the reported physico-chemical properties of the data sets.
The mean "boiling-point" is 252.6593 degrees Celsius for CCL4, with mean values of 250.5943 and 253.8196 for experimental and predicted, respectively. The mean "melting-point" is 34.91613 degrees Celsius for CCL4, with mean values of 23.18876 and 49.99417 for experimental and predicted, respectively.
To explore **all** the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property ("boiling-point", "melting-point", etc...), and then additionally group those by property type ("experimental" vs "predicted"). In the grouping, we look at the columns `value`, `unit`, `propertyID` and `propType`. We also demonstrate how take the mean of the values for each grouping, using the chemical identifier 'DTXSID1037567' for this example, the 25th chemical in CCL4.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
head(ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], ])
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(propType, value, unit),
by = .(propertyId)]
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(value, unit),
by = .(propertyId, propType)]
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(Mean_value = sapply(.SD, mean)),
by = .(propertyId, unit), .SDcols = c("value")]
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(Mean_value = sapply(.SD, mean)),
by = .(propertyId, unit, propType),
.SDcols = c("value")][order(propertyId)]
```
## Review Physico-Chemical Properties Across Chemical Lists
We consider exploring the differences in mean predicted and experimental values for a variety of physico-chemical properties in an effort to understand better the CCL4 and NATADB lists. In particular, we examine "vapor-pressure", "henrys-law", and "boiling-point" and plot the means by chemical for these using boxplots. We then compare the values by grouping by both data set and `propType` value.
### Vapor Pressure
Begin by grouping pulled data by DTXSID, and also by DTXSID and property type.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_vapor_all <- ccl4_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
natadb_vapor_all <- natadb_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
ccl4_vapor_grouped <- ccl4_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure = sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
natadb_vapor_grouped <- natadb_phys_chem[propertyId %in% 'vapor-pressure',
.(mean_vapor_pressure =
sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
```
Examine summary statistics.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
summary(ccl4_vapor_all)
summary(ccl4_vapor_grouped)
summary(natadb_vapor_all)
summary(natadb_vapor_grouped)
```
With such a large range of values covering several orders of magnitude, log transform the data. This data from both chemical lists can also be plotted individually and by property type.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
ccl4_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
natadb_vapor_all[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
natadb_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
```
```{r, fig.align='center', echo=FALSE, eval=FALSE}
ggplot(ccl4_vapor_all, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot() +
coord_flip()
ggplot(ccl4_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
geom_boxplot()
```
```{r, fig.align='center', echo=FALSE, eval=FALSE}
ggplot(natadb_vapor_all, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot() + coord_flip()
ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
geom_boxplot()
```
Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function `rbind()`.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_vapor_grouped[, set := 'CCL4']
natadb_vapor_grouped[, set := 'NATADB']
all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped)
vapor_box <- ggplot(all_vapor_grouped,
aes(set, log_transform_mean_vapor_pressure)) +
geom_boxplot(aes(color = propType))
vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot((aes(color = set))) +
coord_flip()
```
Plot the combined data. Boxplots are colored based on the property type, with mean log transformed vapor pressure plotted for each chemical list and property type, or by chemical list alone.
```{r fig.align='center', class.source="scroll-200", echo=FALSE}
gridExtra::grid.arrange(vapor_box, vapor, ncol=2)
```
In the box plots above, a general trend indicates that that the NATADB chemical list has a higher mean vapor pressure than the CCL4 chemical list.
### Henry's Law constant
Henry's Law constant can be explored in a similar fashion. Begin by grouping data by DTXSID, and also by DTXSID and property type.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_hlc_all <- ccl4_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
natadb_hlc_all <- natadb_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
ccl4_hlc_grouped <- ccl4_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
natadb_hlc_grouped <- natadb_phys_chem[propertyId %in% 'henrys-law',
.(mean_hlc = sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
```
Examine summary statistics.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
summary(ccl4_hlc_all)
summary(ccl4_hlc_grouped)
summary(natadb_hlc_all)
summary(natadb_hlc_grouped)
```
Again, log transform the data as it is positive and covers several orders of magnitude.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
```
Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function `rbind()`.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_hlc_grouped[, set := 'CCL4']
natadb_hlc_grouped[, set := 'NATADB']
all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped)
hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) +
geom_boxplot(aes(color = propType))
hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) +
geom_boxplot(aes(color = set)) +
coord_flip()
```
```{r fig.align='center',class.source="scroll-200", echo=FALSE}
gridExtra::grid.arrange(hlc_box, hlc, ncol=2)
```
Again, in both grouping by `propType` and aggregating all results together by chemical list, NATADB chemicals generally higher mean Henry's Law Constant value than CCL4 chemicals.
### Boling Point
Boiling Point data be explored. Begin by grouping data by DTXSID, and also by DTXSID and property type.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_boiling_all <- ccl4_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point = sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
natadb_boiling_all <- natadb_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point =
sapply(.SD, mean)),
.SDcols = c('value'), by = .(dtxsid)]
ccl4_boiling_grouped <- ccl4_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point =
sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
natadb_boiling_grouped <- natadb_phys_chem[propertyId %in% 'boiling-point',
.(mean_boiling_point =
sapply(.SD, mean)),
.SDcols = c('value'),
by = .(dtxsid, propType)]
```
Calculate summary statistics.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
summary(ccl4_boiling_all)
summary(ccl4_boiling_grouped)
summary(natadb_boiling_all)
summary(natadb_boiling_grouped)
```
Since some of the boiling point values have negative values, log transformation of these values will result in warnings as NaNs are produced.
Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function `rbind()`.
```{r fig.align='center',class.source="scroll-300",message=FALSE}
ccl4_boiling_grouped[, set := 'CCL4']
natadb_boiling_grouped[, set := 'NATADB']
all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped)
boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) +
geom_boxplot(aes(color = propType))
boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) +
geom_boxplot(aes(color = set)) +
coord_flip()
```
```{r fig.align='center',class.source="scroll-200", echo=FALSE}
gridExtra::grid.arrange(boiling_box, boiling, ncol=2)
```
A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the predicted values for each data set tend to be higher than the experimental. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two chemical lists as it was in the previous physico-chemical properties investigated.
To summarize the observations, across the various physico-chemical properties for chemicals in these chemical lists, there are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental.
- For "vapor-pressure", the means of predicted values tend to be a little lower than experimental, though they are much closer in the case of NATADB than CCL4.
- The trend of lower predicted means compared to experimental means is more clearly demonstrated for "henrys-law" values in both data sets.
- In the case of "boiling-point", the predicted values are greater than the experimental values, though this is much more pronounced in CCL4 while the set of means for NATADB are again fairly close.
# Conclusion
In this vignette, a variety of functions that access different types of data found in the `Chemical` endpoints of the CTX APIs were explored. While this exploration was not exhaustive, it provides a basic introduction to how one may access data and work with it. Additional endpoints and corresponding functions exist and we encourage the user to explore these while keeping in mind the examples contained in this vignette.
```{r breakdown, echo = FALSE, results = 'hide'}
# This chunk will be hidden in the final product. It serves to undo defining the
# custom print function to prevent unexpected behavior after this module during
# the final knitting process and restores original option values.
knit_print.data.table = knitr::normal_print
registerS3method(
"knit_print", "data.table", knit_print.data.table,
envir = asNamespace("knitr")
)
options(old_options)
```
```{r, include=FALSE}
end_vignette()
```