The diagnose_category() produces information for diagnosing the quality of the variables of data.frame or tbl_df.

diagnose_category(.data, ...)

# S3 method for data.frame
diagnose_category(
  .data,
  ...,
  top = 10,
  type = c("rank", "n")[2],
  add_character = TRUE,
  add_date = TRUE
)

Arguments

.data

a data.frame or a tbl_df.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

top

an integer. Specifies the upper top rows or rank to extract. Default is 10.

type

a character string specifying how result are extracted. "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. Default is "n" extract only top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n".

add_character

logical. Decide whether to include text variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.

add_date

ogical. Decide whether to include Date and POSIXct variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.

Value

an object of tbl_df.

Details

The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100 then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0 variable is likely to be an identifier.

Categorical diagnostic information

The information derived from the categorical data diagnosis is as follows.

  • variables : variable names

  • levels: level names

  • N : number of observation

  • freq : number of observation at the levels

  • ratio : percentage of observation at the levels

  • rank : rank of occupancy ratio of levels

See vignette("diagonosis") for an introduction to these concepts.

See also

Examples

# \donttest{ # Diagnosis of categorical variables diagnose_category(jobchange)
#> # A tibble: 78 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1 #> 4 enrollee_id 10001 19158 1 0.00522 1 #> 5 enrollee_id 10002 19158 1 0.00522 1 #> 6 enrollee_id 10003 19158 1 0.00522 1 #> 7 enrollee_id 10004 19158 1 0.00522 1 #> 8 enrollee_id 10005 19158 1 0.00522 1 #> 9 enrollee_id 10006 19158 1 0.00522 1 #> 10 enrollee_id 10008 19158 1 0.00522 1 #> # … with 68 more rows
# Select the variable to diagnose # diagnose_category(jobchange, education_level, company_type) # diagnose_category(jobchange, -education_level, -company_type) # diagnose_category(jobchange, "education_level", "company_type") # diagnose_category(jobchange, 7) # Using pipes --------------------------------- library(dplyr) # Diagnosis of all categorical variables jobchange %>% diagnose_category()
#> # A tibble: 78 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1 #> 4 enrollee_id 10001 19158 1 0.00522 1 #> 5 enrollee_id 10002 19158 1 0.00522 1 #> 6 enrollee_id 10003 19158 1 0.00522 1 #> 7 enrollee_id 10004 19158 1 0.00522 1 #> 8 enrollee_id 10005 19158 1 0.00522 1 #> 9 enrollee_id 10006 19158 1 0.00522 1 #> 10 enrollee_id 10008 19158 1 0.00522 1 #> # … with 68 more rows
# Positive values select variables jobchange %>% diagnose_category(company_type, job_chnge)
#> # A tibble: 9 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 company_type Pvt Ltd 19158 9817 51.2 1 #> 2 company_type NA 19158 6140 32.0 2 #> 3 company_type Funded Startup 19158 1001 5.22 3 #> 4 company_type Public Sector 19158 955 4.98 4 #> 5 company_type Early Stage Startup 19158 603 3.15 5 #> 6 company_type NGO 19158 521 2.72 6 #> 7 company_type Other 19158 121 0.632 7 #> 8 job_chnge No 19158 14381 75.1 1 #> 9 job_chnge Yes 19158 4777 24.9 2
# Negative values to drop variables jobchange %>% diagnose_category(-company_type, -job_chnge)
#> # A tibble: 69 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1 #> 4 enrollee_id 10001 19158 1 0.00522 1 #> 5 enrollee_id 10002 19158 1 0.00522 1 #> 6 enrollee_id 10003 19158 1 0.00522 1 #> 7 enrollee_id 10004 19158 1 0.00522 1 #> 8 enrollee_id 10005 19158 1 0.00522 1 #> 9 enrollee_id 10006 19158 1 0.00522 1 #> 10 enrollee_id 10008 19158 1 0.00522 1 #> # … with 59 more rows
# Positions values select variables jobchange %>% diagnose_category(7)
#> # A tibble: 6 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 education_level Graduate 19158 11598 60.5 1 #> 2 education_level Masters 19158 4361 22.8 2 #> 3 education_level High School 19158 2017 10.5 3 #> 4 education_level NA 19158 460 2.40 4 #> 5 education_level Phd 19158 414 2.16 5 #> 6 education_level Primary School 19158 308 1.61 6
# Negative values to drop variables jobchange %>% diagnose_category(-7)
#> # A tibble: 72 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1 #> 4 enrollee_id 10001 19158 1 0.00522 1 #> 5 enrollee_id 10002 19158 1 0.00522 1 #> 6 enrollee_id 10003 19158 1 0.00522 1 #> 7 enrollee_id 10004 19158 1 0.00522 1 #> 8 enrollee_id 10005 19158 1 0.00522 1 #> 9 enrollee_id 10006 19158 1 0.00522 1 #> 10 enrollee_id 10008 19158 1 0.00522 1 #> # … with 62 more rows
# Top rank levels with top argument jobchange %>% diagnose_category(top = 2)
#> # A tibble: 24 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 city city_103 19158 4355 22.7 1 #> 4 city city_21 19158 2702 14.1 2 #> 5 gender Male 19158 13221 69.0 1 #> 6 gender NA 19158 4508 23.5 2 #> 7 relevent_experience Has relevent experience 19158 13792 72.0 1 #> 8 relevent_experience No relevent experience 19158 5366 28.0 2 #> 9 enrolled_university no_enrollment 19158 13817 72.1 1 #> 10 enrolled_university Full time course 19158 3757 19.6 2 #> # … with 14 more rows
# Using pipes & dplyr ------------------------- # Extraction of level that is more than 60% of categorical data jobchange %>% diagnose_category() %>% filter(ratio >= 60)
#> # A tibble: 6 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 gender Male 19158 13221 69.0 1 #> 2 relevent_experience Has relevent experience 19158 13792 72.0 1 #> 3 enrolled_university no_enrollment 19158 13817 72.1 1 #> 4 education_level Graduate 19158 11598 60.5 1 #> 5 major_discipline STEM 19158 14492 75.6 1 #> 6 job_chnge No 19158 14381 75.1 1
# All observations of enrollee_id have a rank of 1. # Because it is a unique identifier. Therefore, if you select up to the top rank 3, # all records are displayed. It will probably fill your screen. # extract rows that less than equal rank 3 # default of type argument is "n" jobchange %>% diagnose_category(enrollee_id, top = 3)
#> # A tibble: 3 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1
# extract rows that less than equal rank 3 jobchange %>% diagnose_category(enrollee_id, top = 3, type = "rank")
#> # A tibble: 19,158 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1 #> 4 enrollee_id 10001 19158 1 0.00522 1 #> 5 enrollee_id 10002 19158 1 0.00522 1 #> 6 enrollee_id 10003 19158 1 0.00522 1 #> 7 enrollee_id 10004 19158 1 0.00522 1 #> 8 enrollee_id 10005 19158 1 0.00522 1 #> 9 enrollee_id 10006 19158 1 0.00522 1 #> 10 enrollee_id 10008 19158 1 0.00522 1 #> # … with 19,148 more rows
# extract only 3 rows jobchange %>% diagnose_category(enrollee_id, top = 3, type = "n")
#> # A tibble: 3 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1
# }