The diagnose_sparese() checks for combinations of levels that do not appear as data among all combinations of levels of categorical variables.

diagnose_sparese(.data, ...)

# S3 method for data.frame
diagnose_sparese(
  .data,
  ...,
  type = c("all", "sparse")[2],
  add_character = FALSE,
  limit = 500
)

Arguments

.data

a data.frame or a tbl_df.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_sparese() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

type

a character string specifying how result are extracted. "all" that returns a combination of all possible levels. At this time, the frequency of each case is also returned.. Default is "sparse" returns only sparse level combinations.

add_character

logical. Decide whether to include text variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.

limit

integer. Conditions to check sparse levels. If the number of all possible combinations exceeds the limit, the calculation ends.

Value

an object of data.frame.

Information of sparse levels

The information derived from the sparse levels diagnosis is as follows.

  • variables : level of categorical variables.

  • N : number of observation. (optional)

Examples

# \donttest{ library(dplyr) # Examples of too many combinations diagnose_sparese(jobchange)
#> All possible combinations of categorical variables exceed 500. (Number of combinations: 841,674,240)
#> NULL
# Character type is also included in the combination variable diagnose_sparese(jobchange, add_character = TRUE)
#> All possible combinations of categorical variables exceed 500. (Number of combinations: 1.61248e+13)
#> NULL
# Combination of two variables jobchange %>% diagnose_sparese(education_level, major_discipline)
#> # A tibble: 13 x 2 #> education_level major_discipline #> <fct> <fct> #> 1 Primary School Arts #> 2 High School Arts #> 3 Primary School Business Degree #> 4 High School Business Degree #> 5 Primary School Humanities #> 6 High School Humanities #> 7 Primary School No Major #> 8 High School No Major #> 9 Phd No Major #> 10 Primary School Other #> 11 High School Other #> 12 Primary School STEM #> 13 High School STEM
# Remove two categorical variables from combination jobchange %>% diagnose_sparese(-city, -education_level)
#> All possible combinations of categorical variables exceed 500. (Number of combinations: 1,368,576)
#> NULL
diagnose_sparese(heartfailure)
#> # A tibble: 14 x 6 #> anaemia diabetes hblood_pressure sex smoking death_event #> <fct> <fct> <fct> <fct> <fct> <fct> #> 1 No No No Female Yes No #> 2 Yes No No Female Yes No #> 3 No Yes No Female Yes No #> 4 Yes Yes No Female Yes No #> 5 Yes No Yes Female Yes No #> 6 No Yes Yes Female Yes No #> 7 Yes Yes Yes Female Yes No #> 8 Yes Yes Yes Male No Yes #> 9 No No No Female Yes Yes #> 10 Yes No No Female Yes Yes #> 11 No Yes No Female Yes Yes #> 12 No No Yes Female Yes Yes #> 13 Yes Yes Yes Female Yes Yes #> 14 Yes Yes Yes Male Yes Yes
# Adjust the threshold of limt to calculate diagnose_sparese(heartfailure, limit = 50)
#> All possible combinations of categorical variables exceed 50. (Number of combinations: 64)
#> NULL
# List all combinations, including parese cases diagnose_sparese(heartfailure, type = "all")
#> anaemia diabetes hblood_pressure sex smoking death_event n_case #> 1 No No No Female No No 10 #> 2 Yes No No Female No No 11 #> 3 No Yes No Female No No 14 #> 4 Yes Yes No Female No No 9 #> 5 No No Yes Female No No 8 #> 6 Yes No Yes Female No No 6 #> 7 No Yes Yes Female No No 6 #> 8 Yes Yes Yes Female No No 6 #> 9 No No No Male No No 12 #> 10 Yes No No Male No No 7 #> 11 No Yes No Male No No 13 #> 12 Yes Yes No Male No No 11 #> 13 No No Yes Male No No 8 #> 14 Yes No Yes Male No No 8 #> 15 No Yes Yes Male No No 5 #> 16 Yes Yes Yes Male No No 3 #> 17 No No No Female Yes No 0 #> 18 Yes No No Female Yes No 0 #> 19 No Yes No Female Yes No 0 #> 20 Yes Yes No Female Yes No 0 #> 21 No No Yes Female Yes No 1 #> 22 Yes No Yes Female Yes No 0 #> 23 No Yes Yes Female Yes No 0 #> 24 Yes Yes Yes Female Yes No 0 #> 25 No No No Male Yes No 26 #> 26 Yes No No Male Yes No 12 #> 27 No Yes No Male Yes No 8 #> 28 Yes Yes No Male Yes No 4 #> 29 No No Yes Male Yes No 5 #> 30 Yes No Yes Male Yes No 4 #> 31 No Yes Yes Male Yes No 4 #> 32 Yes Yes Yes Male Yes No 2 #> 33 No No No Female No Yes 5 #> 34 Yes No No Female No Yes 2 #> 35 No Yes No Female No Yes 3 #> 36 Yes Yes No Female No Yes 6 #> 37 No No Yes Female No Yes 2 #> 38 Yes No Yes Female No Yes 4 #> 39 No Yes Yes Female No Yes 3 #> 40 Yes Yes Yes Female No Yes 6 #> 41 No No No Male No Yes 8 #> 42 Yes No No Male No Yes 10 #> 43 No Yes No Male No Yes 4 #> 44 Yes Yes No Male No Yes 3 #> 45 No No Yes Male No Yes 4 #> 46 Yes No Yes Male No Yes 3 #> 47 No Yes Yes Male No Yes 3 #> 48 Yes Yes Yes Male No Yes 0 #> 49 No No No Female Yes Yes 0 #> 50 Yes No No Female Yes Yes 0 #> 51 No Yes No Female Yes Yes 0 #> 52 Yes Yes No Female Yes Yes 1 #> 53 No No Yes Female Yes Yes 0 #> 54 Yes No Yes Female Yes Yes 1 #> 55 No Yes Yes Female Yes Yes 1 #> 56 Yes Yes Yes Female Yes Yes 0 #> 57 No No No Male Yes Yes 6 #> 58 Yes No No Male Yes Yes 3 #> 59 No Yes No Male Yes Yes 4 #> 60 Yes Yes No Male Yes Yes 2 #> 61 No No Yes Male Yes Yes 3 #> 62 Yes No Yes Male Yes Yes 5 #> 63 No Yes Yes Male Yes Yes 4 #> 64 Yes Yes Yes Male Yes Yes 0
# collaboration with dplyr heartfailure %>% diagnose_sparese(type = "all") %>% arrange(desc(n_case)) %>% mutate(percent = round(n_case / sum(n_case) * 100, 1))
#> anaemia diabetes hblood_pressure sex smoking death_event n_case percent #> 1 No No No Male Yes No 26 8.7 #> 2 No Yes No Female No No 14 4.7 #> 3 No Yes No Male No No 13 4.3 #> 4 No No No Male No No 12 4.0 #> 5 Yes No No Male Yes No 12 4.0 #> 6 Yes No No Female No No 11 3.7 #> 7 Yes Yes No Male No No 11 3.7 #> 8 No No No Female No No 10 3.3 #> 9 Yes No No Male No Yes 10 3.3 #> 10 Yes Yes No Female No No 9 3.0 #> 11 No No Yes Female No No 8 2.7 #> 12 No No Yes Male No No 8 2.7 #> 13 Yes No Yes Male No No 8 2.7 #> 14 No Yes No Male Yes No 8 2.7 #> 15 No No No Male No Yes 8 2.7 #> 16 Yes No No Male No No 7 2.3 #> 17 Yes No Yes Female No No 6 2.0 #> 18 No Yes Yes Female No No 6 2.0 #> 19 Yes Yes Yes Female No No 6 2.0 #> 20 Yes Yes No Female No Yes 6 2.0 #> 21 Yes Yes Yes Female No Yes 6 2.0 #> 22 No No No Male Yes Yes 6 2.0 #> 23 No Yes Yes Male No No 5 1.7 #> 24 No No Yes Male Yes No 5 1.7 #> 25 No No No Female No Yes 5 1.7 #> 26 Yes No Yes Male Yes Yes 5 1.7 #> 27 Yes Yes No Male Yes No 4 1.3 #> 28 Yes No Yes Male Yes No 4 1.3 #> 29 No Yes Yes Male Yes No 4 1.3 #> 30 Yes No Yes Female No Yes 4 1.3 #> 31 No Yes No Male No Yes 4 1.3 #> 32 No No Yes Male No Yes 4 1.3 #> 33 No Yes No Male Yes Yes 4 1.3 #> 34 No Yes Yes Male Yes Yes 4 1.3 #> 35 Yes Yes Yes Male No No 3 1.0 #> 36 No Yes No Female No Yes 3 1.0 #> 37 No Yes Yes Female No Yes 3 1.0 #> 38 Yes Yes No Male No Yes 3 1.0 #> 39 Yes No Yes Male No Yes 3 1.0 #> 40 No Yes Yes Male No Yes 3 1.0 #> 41 Yes No No Male Yes Yes 3 1.0 #> 42 No No Yes Male Yes Yes 3 1.0 #> 43 Yes Yes Yes Male Yes No 2 0.7 #> 44 Yes No No Female No Yes 2 0.7 #> 45 No No Yes Female No Yes 2 0.7 #> 46 Yes Yes No Male Yes Yes 2 0.7 #> 47 No No Yes Female Yes No 1 0.3 #> 48 Yes Yes No Female Yes Yes 1 0.3 #> 49 Yes No Yes Female Yes Yes 1 0.3 #> 50 No Yes Yes Female Yes Yes 1 0.3 #> 51 No No No Female Yes No 0 0.0 #> 52 Yes No No Female Yes No 0 0.0 #> 53 No Yes No Female Yes No 0 0.0 #> 54 Yes Yes No Female Yes No 0 0.0 #> 55 Yes No Yes Female Yes No 0 0.0 #> 56 No Yes Yes Female Yes No 0 0.0 #> 57 Yes Yes Yes Female Yes No 0 0.0 #> 58 Yes Yes Yes Male No Yes 0 0.0 #> 59 No No No Female Yes Yes 0 0.0 #> 60 Yes No No Female Yes Yes 0 0.0 #> 61 No Yes No Female Yes Yes 0 0.0 #> 62 No No Yes Female Yes Yes 0 0.0 #> 63 Yes Yes Yes Female Yes Yes 0 0.0 #> 64 Yes Yes Yes Male Yes Yes 0 0.0
# }