The normality() performs Shapiro-Wilk test of normality of numerical(INTEGER, NUMBER, etc.) column of the DBMS table through tbl_dbi.

# S3 method for tbl_dbi
normality(.data, ..., sample = 5000, in_database = FALSE, collect_size = Inf)

Arguments

.data

a tbl_dbi.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, normality() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

sample

the number of samples to perform the test.

in_database

Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory. Not yet supported in_database = TRUE.

collect_size

a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE.

See vignette("EDA") for an introduction to these concepts.

Value

An object of the same class as .data.

Details

This function is useful when used with the group_by function of the dplyr package. If you want to test by level of the categorical data you are interested in, rather than the whole observation, you can use group_tf as the group_by function. This function is computed shapiro.test function.

Normality test information

The information derived from the numerical data test is as follows.

  • statistic : the value of the Shapiro-Wilk statistic.

  • p_value : an approximate p-value for the test. This is said in Roystion(1995) to be adequate for p_value < 0.1.

  • sample : the numer of samples to perform the test. The number of observations supported by the stats::shapiro.test function is 3 to 5000.

Examples

# If you have the 'DBI' and 'RSQLite' packages installed, perform the code block:
if (FALSE) {
library(dplyr)

# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy heartfailure to the DBMS with a table named TB_HEARTFAILURE
copy_to(con_sqlite, heartfailure, name = "TB_HEARTFAILURE", overwrite = TRUE)

# Using pipes ---------------------------------
# Normality test of all numerical variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  normality()

# Positive values select variables, and In-memory mode and collect size is 200
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  normality(platelets, sodium, collect_size  = 200)

# Positions values select variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  normality(1)

# Using pipes & dplyr -------------------------
# Test all numerical variables by 'smoking' and 'death_event',
# and extract only those with 'smoking' variable level is "Yes".
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  group_by(smoking, death_event) %>%
  normality() %>%
  filter(smoking == "Yes")

# extract only those with 'sex' variable level is "Male",
# and test 'sodium' by 'smoking' and 'death_event'
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  filter(sex == "Male") %>%
  group_by(smoking, death_event) %>%
  normality(sodium)

# Test log(sodium) variables by 'smoking' and 'death_event',
# and extract only p.value greater than 0.01.

# SQLite extension functions for log
RSQLite::initExtension(con_sqlite)

con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  mutate(log_sodium = log(sodium)) %>%
  group_by(smoking, death_event) %>%
  normality(log_sodium) %>%
  filter(p_value > 0.01)
 
# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)
}