% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/correlate.R, R/tbl_dbi.R
\name{correlate}
\alias{correlate}
\alias{correlate.data.frame}
\alias{correlate.grouped_df}
\alias{correlate.tbl_dbi}
\title{Compute the correlation coefficient between two variable}
\usage{
correlate(.data, ...)

\method{correlate}{data.frame}(
  .data,
  ...,
  method = c("pearson", "kendall", "spearman", "cramer", "theil")
)

\method{correlate}{grouped_df}(
  .data,
  ...,
  method = c("pearson", "kendall", "spearman", "cramer", "theil")
)

\method{correlate}{tbl_dbi}(
  .data,
  ...,
  method = c("pearson", "kendall", "spearman", "cramer", "theil"),
  in_database = FALSE,
  collect_size = Inf
)
}
\arguments{
\item{.data}{a data.frame or a \code{\link[dplyr]{grouped_df}} or a tbl_dbi.}

\item{...}{one or more unquoted expressions separated by commas.
You can treat variable names like they are positions.
Positive values select variables; negative values to drop variables.
If the first expression is negative, correlate() will automatically start with all variables.
These arguments are automatically quoted and evaluated in a context where column names
represent column positions.
They support unquoting and splicing.

See vignette("EDA") for an introduction to these concepts.}

\item{method}{a character string indicating which correlation coefficient (or covariance) is 
to be computed. One of "pearson" (default), "kendall", or "spearman": can be abbreviated.
For numerical variables, one of "pearson" (default), "kendall", or 
"spearman": can be used as an abbreviation.
For categorical variables, "cramer" and "theil" can be used. "cramer" 
computes Cramer's V statistic, "theil" computes Theil's U statistic.}

\item{in_database}{Specifies whether to perform in-database operations. 
If TRUE, most operations are performed in the DBMS. if FALSE, 
table data is taken in R and operated in-memory. Not yet supported in_database = TRUE.}

\item{collect_size}{a integer. The number of data samples from the DBMS to R. 
Applies only if in_database = FALSE.}
}
\value{
An object of correlate class.
}
\description{
The correlate() compute the correlation coefficient for numerical or categorical data.
}
\details{
This function is useful when used with the group_by() function of the dplyr package.
If you want to compute by level of the categorical data you are interested in,
rather than the whole observation, you can use \code{\link[dplyr]{grouped_df}} as the group_by() function.
This function is computed stats::cor() function by use = "pairwise.complete.obs" option for numerical variable.
And support categorical variable with theil's U correlation coefficient and Cramer's V correlation coefficient.
}
\section{correlate class}{

The correlate class inherits the tibble class and has the following variables.:

\itemize{
\item var1 : names of numerical variable
\item var2 : name of the corresponding numeric variable
\item coef_corr : Correlation coefficient
}

When method = "cramer", data.frame with the following variables is returned.
\itemize{
\item var1 : names of numerical variable
\item var2 : name of the corresponding numeric variable
\item chisq : the value the chi-squared test statistic
\item df : the degrees of freedom of the approximate chi-squared distribution of the test statistic
\item pval : the p-value for the test
\item coef_corr : theil's U correlation coefficient (Uncertainty Coefficient).
}
}

\examples{
\donttest{
# Correlation coefficients of all numerical variables
tab_corr <- correlate(heartfailure)
tab_corr

# Select the variable to compute
correlate(heartfailure, "creatinine", "sodium")

# Non-parametric correlation coefficient by kendall method
correlate(heartfailure, creatinine, method = "kendall")

# theil's U correlation coefficient (Uncertainty Coefficient)
tab_corr <- correlate(heartfailure, anaemia, hblood_pressure, method = "theil")
tab_corr
   
# Using dplyr::grouped_dt
library(dplyr)

gdata <- group_by(heartfailure, smoking, death_event)
correlate(gdata)

# Using pipes ---------------------------------
# Correlation coefficients of all numerical variables
heartfailure \%>\%
  correlate()
  
# Non-parametric correlation coefficient by spearman method
heartfailure \%>\%
  correlate(creatinine, sodium, method = "spearman")
 
# ---------------------------------------------
# Correlation coefficient
# that eliminates redundant combination of variables
heartfailure \%>\%
  correlate() \%>\%
  filter(as.integer(var1) > as.integer(var2))

# Using pipes & dplyr -------------------------
# Compute the correlation coefficient of 'creatinine' variable by 'smoking'
# and 'death_event' variables. And extract only those with absolute
# value of correlation coefficient is greater than 0.2
heartfailure \%>\%
  group_by(smoking, death_event) \%>\%
  correlate(creatinine) \%>\%
  filter(abs(coef_corr) >= 0.2)

# extract only those with 'smoking' variable level is "Yes",
# and compute the correlation coefficient of 'Sales' variable
# by 'hblood_pressure' and 'death_event' variables.
# And the correlation coefficient is negative and smaller than 0.5
heartfailure \%>\%
  filter(smoking == "Yes") \%>\%
  group_by(hblood_pressure, death_event) \%>\%
  correlate(creatinine) \%>\%
  filter(coef_corr < 0) \%>\%
  filter(abs(coef_corr) > 0.5)
}

# If you have the 'DBI' and 'RSQLite' packages installed, perform the code block:
if (FALSE) {
library(dplyr)
# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy heartfailure to the DBMS with a table named TB_HEARTFAILURE
copy_to(con_sqlite, heartfailure, name = "TB_HEARTFAILURE", overwrite = TRUE)

# Using pipes ---------------------------------
# Correlation coefficients of all numerical variables
con_sqlite \%>\% 
  tbl("TB_HEARTFAILURE") \%>\% 
  correlate()

# Using pipes & dplyr -------------------------
# Compute the correlation coefficient of creatinine variable by 'hblood_pressure'
# and 'death_event' variables.
con_sqlite \%>\% 
  tbl("TB_HEARTFAILURE") \%>\% 
  group_by(hblood_pressure, death_event) \%>\%
  correlate(creatinine) 

# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)
}
  
}
\seealso{
\code{\link{cor}}, \code{\link{summary.correlate}}, \code{\link{plot.correlate}}.
}
