Compute the correlation coefficient between two variable

The correlate() compute the correlation coefficient for numerical or categorical data.

correlate(.data, ...)

# S3 method for data.frame
correlate(
  .data,
  ...,
  method = c("pearson", "kendall", "spearman", "cramer", "theil")
)

# S3 method for grouped_df
correlate(
  .data,
  ...,
  method = c("pearson", "kendall", "spearman", "cramer", "theil")
)

# S3 method for tbl_dbi
correlate(
  .data,
  ...,
  method = c("pearson", "kendall", "spearman", "cramer", "theil"),
  in_database = FALSE,
  collect_size = Inf
)

Arguments

.data

a data.frame or a grouped_df or a tbl_dbi.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, correlate() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

See vignette("EDA") for an introduction to these concepts.

method

a character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman": can be abbreviated. For numerical variables, one of "pearson" (default), "kendall", or "spearman": can be used as an abbreviation. For categorical variables, "cramer" and "theil" can be used. "cramer" computes Cramer's V statistic, "theil" computes Theil's U statistic.

in_database

Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory. Not yet supported in_database = TRUE.

collect_size

a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE.

Value

An object of correlate class.

Details

This function is useful when used with the group_by() function of the dplyr package. If you want to compute by level of the categorical data you are interested in, rather than the whole observation, you can use grouped_df as the group_by() function. This function is computed stats::cor() function by use = "pairwise.complete.obs" option for numerical variable. And support categorical variable with theil's U correlation coefficient and Cramer's V correlation coefficient.

correlate class

The correlate class inherits the tibble class and has the following variables.:

var1 : names of numerical variable
var2 : name of the corresponding numeric variable
coef_corr : Correlation coefficient

When method = "cramer", data.frame with the following variables is returned.

var1 : names of numerical variable
var2 : name of the corresponding numeric variable
chisq : the value the chi-squared test statistic
df : the degrees of freedom of the approximate chi-squared distribution of the test statistic
pval : the p-value for the test
coef_corr : theil's U correlation coefficient (Uncertainty Coefficient).

Examples

# \donttest{
# Correlation coefficients of all numerical variables
tab_corr <- correlate(heartfailure)
tab_corr
#> # A tibble: 42 × 3
#>    var1              var2       coef_corr
#>    <fct>             <fct>          <dbl>
#>  1 cpk_enzyme        age          -0.0814
#>  2 ejection_fraction age           0.0602
#>  3 platelets         age          -0.0525
#>  4 creatinine        age           0.159 
#>  5 sodium            age          -0.0459
#>  6 time              age          -0.224 
#>  7 age               cpk_enzyme   -0.0814
#>  8 ejection_fraction cpk_enzyme   -0.0441
#>  9 platelets         cpk_enzyme    0.0245
#> 10 creatinine        cpk_enzyme   -0.0164
#> # ℹ 32 more rows

# Select the variable to compute
correlate(heartfailure, "creatinine", "sodium")
#> # A tibble: 12 × 3
#>    var1       var2              coef_corr
#>    <fct>      <fct>                 <dbl>
#>  1 creatinine age                  0.159 
#>  2 sodium     age                 -0.0459
#>  3 creatinine cpk_enzyme          -0.0164
#>  4 sodium     cpk_enzyme           0.0596
#>  5 creatinine ejection_fraction   -0.0113
#>  6 sodium     ejection_fraction    0.176 
#>  7 creatinine platelets           -0.0412
#>  8 sodium     platelets            0.0621
#>  9 sodium     creatinine          -0.189 
#> 10 creatinine sodium              -0.189 
#> 11 creatinine time                -0.149 
#> 12 sodium     time                 0.0876

# Non-parametric correlation coefficient by kendall method
correlate(heartfailure, creatinine, method = "kendall")
#> # A tibble: 6 × 3
#>   var1       var2              coef_corr
#>   <fct>      <fct>                 <dbl>
#> 1 creatinine age                  0.191 
#> 2 creatinine cpk_enzyme          -0.0351
#> 3 creatinine ejection_fraction   -0.130 
#> 4 creatinine platelets           -0.0357
#> 5 creatinine sodium              -0.223 
#> 6 creatinine time                -0.110 

# theil's U correlation coefficient (Uncertainty Coefficient)
tab_corr <- correlate(heartfailure, anaemia, hblood_pressure, method = "theil")
tab_corr
#>               var1            var2    coef_corr
#> 1  hblood_pressure         anaemia 0.0010925763
#> 2          anaemia        diabetes 0.0001188941
#> 3  hblood_pressure        diabetes 0.0001222019
#> 4          anaemia hblood_pressure 0.0010925763
#> 5          anaemia             sex 0.0067211188
#> 6  hblood_pressure             sex 0.0083558270
#> 7          anaemia         smoking 0.0088747588
#> 8  hblood_pressure         smoking 0.0024572165
#> 9          anaemia     death_event 0.0033373058
#> 10 hblood_pressure     death_event 0.0048836271
   
# Using dplyr::grouped_dt
library(dplyr)

gdata <- group_by(heartfailure, smoking, death_event)
correlate(gdata)
#> # A tibble: 168 × 5
#>    smoking death_event var1              var2       coef_corr
#>    <fct>   <fct>       <fct>             <fct>          <dbl>
#>  1 No      No          cpk_enzyme        age          -0.0393
#>  2 No      No          ejection_fraction age           0.0749
#>  3 No      No          platelets         age          -0.0579
#>  4 No      No          creatinine        age           0.199 
#>  5 No      No          sodium            age          -0.0427
#>  6 No      No          time              age          -0.0193
#>  7 No      No          age               cpk_enzyme   -0.0393
#>  8 No      No          ejection_fraction cpk_enzyme   -0.0819
#>  9 No      No          platelets         cpk_enzyme    0.0610
#> 10 No      No          creatinine        cpk_enzyme   -0.0339
#> # ℹ 158 more rows

# Using pipes ---------------------------------
# Correlation coefficients of all numerical variables
heartfailure %>%
  correlate()
#> # A tibble: 42 × 3
#>    var1              var2       coef_corr
#>    <fct>             <fct>          <dbl>
#>  1 cpk_enzyme        age          -0.0814
#>  2 ejection_fraction age           0.0602
#>  3 platelets         age          -0.0525
#>  4 creatinine        age           0.159 
#>  5 sodium            age          -0.0459
#>  6 time              age          -0.224 
#>  7 age               cpk_enzyme   -0.0814
#>  8 ejection_fraction cpk_enzyme   -0.0441
#>  9 platelets         cpk_enzyme    0.0245
#> 10 creatinine        cpk_enzyme   -0.0164
#> # ℹ 32 more rows
  
# Non-parametric correlation coefficient by spearman method
heartfailure %>%
  correlate(creatinine, sodium, method = "spearman")
#> # A tibble: 12 × 3
#>    var1       var2              coef_corr
#>    <fct>      <fct>                 <dbl>
#>  1 creatinine age                  0.271 
#>  2 sodium     age                 -0.101 
#>  3 creatinine cpk_enzyme          -0.0499
#>  4 sodium     cpk_enzyme           0.0169
#>  5 creatinine ejection_fraction   -0.178 
#>  6 sodium     ejection_fraction    0.162 
#>  7 creatinine platelets           -0.0510
#>  8 sodium     platelets            0.0495
#>  9 sodium     creatinine          -0.300 
#> 10 creatinine sodium              -0.300 
#> 11 creatinine time                -0.161 
#> 12 sodium     time                 0.0864
 
# ---------------------------------------------
# Correlation coefficient
# that eliminates redundant combination of variables
heartfailure %>%
  correlate() %>%
  filter(as.integer(var1) > as.integer(var2))
#> # A tibble: 21 × 3
#>    var1              var2       coef_corr
#>    <fct>             <fct>          <dbl>
#>  1 cpk_enzyme        age          -0.0814
#>  2 ejection_fraction age           0.0602
#>  3 platelets         age          -0.0525
#>  4 creatinine        age           0.159 
#>  5 sodium            age          -0.0459
#>  6 time              age          -0.224 
#>  7 ejection_fraction cpk_enzyme   -0.0441
#>  8 platelets         cpk_enzyme    0.0245
#>  9 creatinine        cpk_enzyme   -0.0164
#> 10 sodium            cpk_enzyme    0.0596
#> # ℹ 11 more rows

# Using pipes & dplyr -------------------------
# Compute the correlation coefficient of 'creatinine' variable by 'smoking'
# and 'death_event' variables. And extract only those with absolute
# value of correlation coefficient is greater than 0.2
heartfailure %>%
  group_by(smoking, death_event) %>%
  correlate(creatinine) %>%
  filter(abs(coef_corr) >= 0.2)
#> # A tibble: 7 × 5
#>   smoking death_event var1       var2              coef_corr
#>   <fct>   <fct>       <fct>      <fct>                 <dbl>
#> 1 No      Yes         creatinine ejection_fraction     0.298
#> 2 Yes     No          creatinine ejection_fraction    -0.201
#> 3 Yes     No          creatinine sodium               -0.290
#> 4 Yes     No          creatinine time                  0.246
#> 5 Yes     Yes         creatinine age                   0.255
#> 6 Yes     Yes         creatinine sodium               -0.286
#> 7 Yes     Yes         creatinine time                 -0.201

# extract only those with 'smoking' variable level is "Yes",
# and compute the correlation coefficient of 'Sales' variable
# by 'hblood_pressure' and 'death_event' variables.
# And the correlation coefficient is negative and smaller than 0.5
heartfailure %>%
  filter(smoking == "Yes") %>%
  group_by(hblood_pressure, death_event) %>%
  correlate(creatinine) %>%
  filter(coef_corr < 0) %>%
  filter(abs(coef_corr) > 0.5)
#> # A tibble: 1 × 5
#>   hblood_pressure death_event var1       var2   coef_corr
#>   <fct>           <fct>       <fct>      <fct>      <dbl>
#> 1 Yes             Yes         creatinine sodium    -0.561
# }

# If you have the 'DBI' and 'RSQLite' packages installed, perform the code block:
if (FALSE) {
library(dplyr)
# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy heartfailure to the DBMS with a table named TB_HEARTFAILURE
copy_to(con_sqlite, heartfailure, name = "TB_HEARTFAILURE", overwrite = TRUE)

# Using pipes ---------------------------------
# Correlation coefficients of all numerical variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate()

# Using pipes & dplyr -------------------------
# Compute the correlation coefficient of creatinine variable by 'hblood_pressure'
# and 'death_event' variables.
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  group_by(hblood_pressure, death_event) %>%
  correlate(creatinine) 

# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)
}