The normality() performs Shapiro-Wilk test of normality of numerical(INTEGER, NUMBER, etc.) column of the DBMS table through tbl_dbi.

# S3 method for tbl_dbi
normality(.data, ..., sample = 5000, in_database = FALSE, collect_size = Inf)

Arguments

.data

a tbl_dbi.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, normality() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

sample

the number of samples to perform the test.

in_database

Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory. Not yet supported in_database = TRUE.

collect_size

a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE.

See vignette("EDA") for an introduction to these concepts.

Value

An object of the same class as .data.

Details

This function is useful when used with the group_by function of the dplyr package. If you want to test by level of the categorical data you are interested in, rather than the whole observation, you can use group_tf as the group_by function. This function is computed shapiro.test function.

Normality test information

The information derived from the numerical data test is as follows.

  • statistic : the value of the Shapiro-Wilk statistic.

  • p_value : an approximate p-value for the test. This is said in Roystion(1995) to be adequate for p_value < 0.1.

  • sample : the numer of samples to perform the test. The number of observations supported by the stats::shapiro.test function is 3 to 5000.

See also

Examples

# \donttest{ library(dplyr) # connect DBMS con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # copy heartfailure to the DBMS with a table named TB_HEARTFAILURE copy_to(con_sqlite, heartfailure, name = "TB_HEARTFAILURE", overwrite = TRUE) # Using pipes --------------------------------- # Normality test of all numerical variables con_sqlite %>% tbl("TB_HEARTFAILURE") %>% normality()
#> # A tibble: 7 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 platelets 0.912 2.88e-12 299 #> 5 creatinine 0.551 5.39e-27 299 #> 6 sodium 0.939 9.21e-10 299 #> 7 time 0.947 6.28e- 9 299
# Positive values select variables, and In-memory mode and collect size is 200 con_sqlite %>% tbl("TB_HEARTFAILURE") %>% normality(platelets, sodium, collect_size = 200)
#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.896 1.43e-10 200 #> 2 sodium 0.915 2.41e- 9 200
# Positions values select variables con_sqlite %>% tbl("TB_HEARTFAILURE") %>% normality(1)
#> # A tibble: 1 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 0.0000511 299
# Using pipes & dplyr ------------------------- # Test all numerical variables by 'smoking' and 'death_event', # and extract only those with 'smoking' variable level is "Yes". con_sqlite %>% tbl("TB_HEARTFAILURE") %>% group_by(smoking, death_event) %>% normality() %>% filter(smoking == "Yes")
#> # A tibble: 14 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <chr> <chr> <dbl> <dbl> <dbl> #> 1 age Yes No 0.972 1.46e- 1 66 #> 2 age Yes Yes 0.967 4.61e- 1 30 #> 3 cpk_enzyme Yes No 0.563 1.03e-12 66 #> 4 cpk_enzyme Yes Yes 0.391 4.50e-10 30 #> 5 ejection_fraction Yes No 0.914 2.16e- 4 66 #> 6 ejection_fraction Yes Yes 0.868 1.49e- 3 30 #> 7 platelets Yes No 0.834 4.04e- 7 66 #> 8 platelets Yes Yes 0.900 8.36e- 3 30 #> 9 creatinine Yes No 0.675 8.20e-11 66 #> 10 creatinine Yes Yes 0.513 7.43e- 9 30 #> 11 sodium Yes No 0.817 1.30e- 7 66 #> 12 sodium Yes Yes 0.978 7.82e- 1 30 #> 13 time Yes No 0.943 4.68e- 3 66 #> 14 time Yes Yes 0.832 2.65e- 4 30
# extract only those with 'sex' variable level is "Male", # and test 'sodium' by 'smoking' and 'death_event' con_sqlite %>% tbl("TB_HEARTFAILURE") %>% filter(sex == "Male") %>% group_by(smoking, death_event) %>% normality(sodium)
#> # A tibble: 4 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <chr> <chr> <dbl> <dbl> <dbl> #> 1 sodium No No 0.975 0.188 67 #> 2 sodium No Yes 0.948 0.0957 35 #> 3 sodium Yes No 0.817 0.000000157 65 #> 4 sodium Yes Yes 0.980 0.862 27
# Test log(sodium) variables by 'smoking' and 'death_event', # and extract only p.value greater than 0.01. # SQLite extension functions for log RSQLite::initExtension(con_sqlite) con_sqlite %>% tbl("TB_HEARTFAILURE") %>% mutate(log_sodium = log(sodium)) %>% group_by(smoking, death_event) %>% normality(log_sodium) %>% filter(p_value > 0.01)
#> # A tibble: 2 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <chr> <chr> <dbl> <dbl> <dbl> #> 1 log_sodium No No 0.977 0.0190 137 #> 2 log_sodium Yes Yes 0.978 0.781 30
# Disconnect DBMS DBI::dbDisconnect(con_sqlite) # }