The normality() performs Shapiro-Wilk test of normality of numerical values.

normality(.data, ...)

# S3 method for data.frame
normality(.data, ..., sample = 5000)

# S3 method for grouped_df
normality(.data, ..., sample = 5000)

Arguments

.data

a data.frame or a tbl_df.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, normality() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

sample

the number of samples to perform the test.

See vignette("EDA") for an introduction to these concepts.

Value

An object of the same class as .data.

Details

This function is useful when used with the group_by function of the dplyr package. If you want to test by level of the categorical data you are interested in, rather than the whole observation, you can use group_tf as the group_by function. This function is computed shapiro.test function.

Normality test information

The information derived from the numerical data test is as follows.

  • statistic : the value of the Shapiro-Wilk statistic.

  • p_value : an approximate p-value for the test. This is said in Roystion(1995) to be adequate for p_value < 0.1.

  • sample : the number of samples to perform the test. The number of observations supported by the stats::shapiro.test function is 3 to 5000.

See also

Examples

# \donttest{ # Normality test of numerical variables normality(heartfailure)
#> # A tibble: 7 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 platelets 0.912 2.88e-12 299 #> 5 creatinine 0.551 5.39e-27 299 #> 6 sodium 0.939 9.21e-10 299 #> 7 time 0.947 6.28e- 9 299
# Select the variable to describe normality(heartfailure, platelets, sodium)
#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.912 2.88e-12 299 #> 2 sodium 0.939 9.21e-10 299
normality(heartfailure, -platelets, -sodium)
#> # A tibble: 5 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 creatinine 0.551 5.39e-27 299 #> 5 time 0.947 6.28e- 9 299
normality(heartfailure, 1)
#> # A tibble: 1 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 0.0000511 299
normality(heartfailure, platelets, sodium, sample = 200)
#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.907 6.67e-10 200 #> 2 sodium 0.936 1.03e- 7 200
# death_eventing dplyr::grouped_dt library(dplyr) gdata <- group_by(heartfailure, smoking, death_event) normality(gdata, "platelets")
#> # A tibble: 4 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 platelets No No 0.892 0.0000000161 137 #> 2 platelets No Yes 0.987 0.704 66 #> 3 platelets Yes No 0.834 0.000000404 66 #> 4 platelets Yes Yes 0.900 0.00836 30
normality(gdata, sample = 250)
#> # A tibble: 28 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 age No No 0.978 2.49e- 2 137 #> 2 age No Yes 0.955 1.70e- 2 66 #> 3 age Yes No 0.972 1.46e- 1 66 #> 4 age Yes Yes 0.967 4.61e- 1 30 #> 5 cpk_enzyme No No 0.715 5.49e-15 137 #> 6 cpk_enzyme No Yes 0.458 3.20e-14 66 #> 7 cpk_enzyme Yes No 0.563 1.03e-12 66 #> 8 cpk_enzyme Yes Yes 0.391 4.50e-10 30 #> 9 ejection_fraction No No 0.918 4.26e- 7 137 #> 10 ejection_fraction No Yes 0.941 3.44e- 3 66 #> # … with 18 more rows
# death_eventing pipes --------------------------------- # Normality test of all numerical variables heartfailure %>% normality()
#> # A tibble: 7 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 5.11e- 5 299 #> 2 cpk_enzyme 0.514 7.05e-28 299 #> 3 ejection_fraction 0.947 7.22e- 9 299 #> 4 platelets 0.912 2.88e-12 299 #> 5 creatinine 0.551 5.39e-27 299 #> 6 sodium 0.939 9.21e-10 299 #> 7 time 0.947 6.28e- 9 299
# # Positive values select variables heartfailure %>% normality(platelets, sodium)
#> # A tibble: 2 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 platelets 0.912 2.88e-12 299 #> 2 sodium 0.939 9.21e-10 299
# Positions values select variables heartfailure %>% normality(1)
#> # A tibble: 1 x 4 #> vars statistic p_value sample #> <chr> <dbl> <dbl> <dbl> #> 1 age 0.975 0.0000511 299
# death_eventing pipes & dplyr ------------------------- # Test all numerical variables by 'smoking' and 'death_event', # and extract only those with 'smoking' variable level is "No". heartfailure %>% group_by(smoking, death_event) %>% normality() %>% filter(smoking == "No")
#> # A tibble: 14 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 age No No 0.978 2.49e- 2 137 #> 2 age No Yes 0.955 1.70e- 2 66 #> 3 cpk_enzyme No No 0.715 5.49e-15 137 #> 4 cpk_enzyme No Yes 0.458 3.20e-14 66 #> 5 ejection_fraction No No 0.918 4.26e- 7 137 #> 6 ejection_fraction No Yes 0.941 3.44e- 3 66 #> 7 platelets No No 0.892 1.61e- 8 137 #> 8 platelets No Yes 0.987 7.04e- 1 66 #> 9 creatinine No No 0.550 1.00e-18 137 #> 10 creatinine No Yes 0.655 3.52e-11 66 #> 11 sodium No No 0.980 4.11e- 2 137 #> 12 sodium No Yes 0.948 7.83e- 3 66 #> 13 time No No 0.929 2.29e- 6 137 #> 14 time No Yes 0.870 5.24e- 6 66
# extract only those with 'sex' variable level is "Male", # and test 'platelets' by 'smoking' and 'death_event' heartfailure %>% filter(sex == "Male") %>% group_by(smoking, death_event) %>% normality(platelets)
#> # A tibble: 4 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 platelets No No 0.963 0.0429 67 #> 2 platelets No Yes 0.966 0.349 35 #> 3 platelets Yes No 0.832 0.000000406 65 #> 4 platelets Yes Yes 0.963 0.429 27
# Test log(platelets) variables by 'smoking' and 'death_event', # and extract only p.value greater than 0.01. heartfailure %>% mutate(platelets_income = log(platelets)) %>% group_by(smoking, death_event) %>% normality(platelets_income) %>% filter(p_value > 0.01)
#> # A tibble: 1 x 6 #> variable smoking death_event statistic p_value sample #> <chr> <fct> <fct> <dbl> <dbl> <dbl> #> 1 platelets_income Yes Yes 0.983 0.907 30
# }