The normality() performs Shapiro-Wilk test of normality of numerical values.
normality(.data, ...) # S3 method for data.frame normality(.data, ..., sample = 5000) # S3 method for grouped_df normality(.data, ..., sample = 5000)
.data | a data.frame or a |
---|---|
... | one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, normality() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing. |
sample | the number of samples to perform the test. See vignette("EDA") for an introduction to these concepts. |
An object of the same class as .data.
This function is useful when used with the group_by
function of the dplyr package. If you want to test by level of the categorical
data you are interested in, rather than the whole observation,
you can use group_tf as the group_by function.
This function is computed shapiro.test
function.
The information derived from the numerical data test is as follows.
statistic : the value of the Shapiro-Wilk statistic.
p_value : an approximate p-value for the test. This is said in Roystion(1995) to be adequate for p_value < 0.1.
sample : the number of samples to perform the test. The number of observations supported by the stats::shapiro.test function is 3 to 5000.
# \donttest{ # Normality test of numerical variables normality(heartfailure)#> # A tibble: 7 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 platelets 0.912 2.88e-12 299 #> 5 creatinine 0.551 5.39e-27 299 #> 6 sodium 0.939 9.21e-10 299 #> 7 time 0.947 6.28e- 9 299# Select the variable to describe normality(heartfailure, platelets, sodium)#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.912 2.88e-12 299 #> 2 sodium 0.939 9.21e-10 299normality(heartfailure, -platelets, -sodium)#> # A tibble: 5 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 creatinine 0.551 5.39e-27 299 #> 5 time 0.947 6.28e- 9 299normality(heartfailure, 1)#> # A tibble: 1 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 0.0000511 299normality(heartfailure, platelets, sodium, sample = 200)#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.907 6.67e-10 200 #> 2 sodium 0.936 1.03e- 7 200# death_eventing dplyr::grouped_dt library(dplyr) gdata <- group_by(heartfailure, smoking, death_event) normality(gdata, "platelets")#> # A tibble: 4 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 platelets No No 0.892 0.0000000161 137 #> 2 platelets No Yes 0.987 0.704 66 #> 3 platelets Yes No 0.834 0.000000404 66 #> 4 platelets Yes Yes 0.900 0.00836 30normality(gdata, sample = 250)#> # A tibble: 28 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 age No No 0.978 2.49e- 2 137 #> 2 age No Yes 0.955 1.70e- 2 66 #> 3 age Yes No 0.972 1.46e- 1 66 #> 4 age Yes Yes 0.967 4.61e- 1 30 #> 5 cpk_enzyme No No 0.715 5.49e-15 137 #> 6 cpk_enzyme No Yes 0.458 3.20e-14 66 #> 7 cpk_enzyme Yes No 0.563 1.03e-12 66 #> 8 cpk_enzyme Yes Yes 0.391 4.50e-10 30 #> 9 ejection_fraction No No 0.918 4.26e- 7 137 #> 10 ejection_fraction No Yes 0.941 3.44e- 3 66 #> # … with 18 more rows# death_eventing pipes --------------------------------- # Normality test of all numerical variables heartfailure %>% normality()#> # A tibble: 7 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 platelets 0.912 2.88e-12 299 #> 5 creatinine 0.551 5.39e-27 299 #> 6 sodium 0.939 9.21e-10 299 #> 7 time 0.947 6.28e- 9 299# # Positive values select variables heartfailure %>% normality(platelets, sodium)#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.912 2.88e-12 299 #> 2 sodium 0.939 9.21e-10 299# Positions values select variables heartfailure %>% normality(1)#> # A tibble: 1 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 0.0000511 299# death_eventing pipes & dplyr ------------------------- # Test all numerical variables by 'smoking' and 'death_event', # and extract only those with 'smoking' variable level is "No". heartfailure %>% group_by(smoking, death_event) %>% normality() %>% filter(smoking == "No")#> # A tibble: 14 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 age No No 0.978 2.49e- 2 137 #> 2 age No Yes 0.955 1.70e- 2 66 #> 3 cpk_enzyme No No 0.715 5.49e-15 137 #> 4 cpk_enzyme No Yes 0.458 3.20e-14 66 #> 5 ejection_fraction No No 0.918 4.26e- 7 137 #> 6 ejection_fraction No Yes 0.941 3.44e- 3 66 #> 7 platelets No No 0.892 1.61e- 8 137 #> 8 platelets No Yes 0.987 7.04e- 1 66 #> 9 creatinine No No 0.550 1.00e-18 137 #> 10 creatinine No Yes 0.655 3.52e-11 66 #> 11 sodium No No 0.980 4.11e- 2 137 #> 12 sodium No Yes 0.948 7.83e- 3 66 #> 13 time No No 0.929 2.29e- 6 137 #> 14 time No Yes 0.870 5.24e- 6 66# extract only those with 'sex' variable level is "Male", # and test 'platelets' by 'smoking' and 'death_event' heartfailure %>% filter(sex == "Male") %>% group_by(smoking, death_event) %>% normality(platelets)#> # A tibble: 4 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 platelets No No 0.963 0.0429 67 #> 2 platelets No Yes 0.966 0.349 35 #> 3 platelets Yes No 0.832 0.000000406 65 #> 4 platelets Yes Yes 0.963 0.429 27# Test log(platelets) variables by 'smoking' and 'death_event', # and extract only p.value greater than 0.01. heartfailure %>% mutate(platelets_income = log(platelets)) %>% group_by(smoking, death_event) %>% normality(platelets_income) %>% filter(p_value > 0.01)#> # A tibble: 1 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 platelets_income Yes Yes 0.983 0.907 30# }