The diagnose_category() produces information for diagnosing the quality of the variables of data.frame or tbl_df.
a data.frame or a tbl_df
or a grouped_df
.
one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.
an integer. Specifies the upper top rows or rank to extract. Default is 10.
a character string specifying how result are extracted. "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. Default is "n" extract only top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n".
logical. Decide whether to include text variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.
ogical. Decide whether to include Date and POSIXct variables in the diagnosis of categorical data. The default value is TRUE, which also includes character variables.
an object of tbl_df.
The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100 then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0 variable is likely to be an identifier.
The information derived from the categorical data diagnosis is as follows.
variables : variable names
levels: level names
N : number of observation
freq : number of observation at the levels
ratio : percentage of observation at the levels
rank : rank of occupancy ratio of levels
See vignette("diagonosis") for an introduction to these concepts.
# \donttest{
# Diagnosis of categorical variables
diagnose_category(jobchange)
#> variables levels N freq ratio rank
#> 1 enrollee_id 1 19158 1 0.005219752 1
#> 2 enrollee_id 10 19158 1 0.005219752 1
#> 3 enrollee_id 10000 19158 1 0.005219752 1
#> 4 enrollee_id 10001 19158 1 0.005219752 1
#> 5 enrollee_id 10002 19158 1 0.005219752 1
#> 6 enrollee_id 10003 19158 1 0.005219752 1
#> 7 enrollee_id 10004 19158 1 0.005219752 1
#> 8 enrollee_id 10005 19158 1 0.005219752 1
#> 9 enrollee_id 10006 19158 1 0.005219752 1
#> 10 enrollee_id 10008 19158 1 0.005219752 1
#> 11 city city_103 19158 4355 22.732017956 1
#> 12 city city_21 19158 2702 14.103768661 2
#> 13 city city_16 19158 1533 8.001879111 3
#> 14 city city_114 19158 1336 6.973588057 4
#> 15 city city_160 19158 845 4.410690051 5
#> 16 city city_136 19158 586 3.058774402 6
#> 17 city city_67 19158 431 2.249712914 7
#> 18 city city_75 19158 305 1.592024220 8
#> 19 city city_102 19158 304 1.586804468 9
#> 20 city city_104 19158 301 1.571145213 10
#> 21 gender Male 19158 13221 69.010335108 1
#> 22 gender <NA> 19158 4508 23.530639942 2
#> 23 gender Female 19158 1238 6.462052406 3
#> 24 gender Other 19158 191 0.996972544 4
#> 25 relevent_experience Has relevent experience 19158 13792 71.990813237 1
#> 26 relevent_experience No relevent experience 19158 5366 28.009186763 2
#> 27 enrolled_university no_enrollment 19158 13817 72.121307026 1
#> 28 enrolled_university Full time course 19158 3757 19.610606535 2
#> 29 enrolled_university Part time course 19158 1198 6.253262345 3
#> 30 enrolled_university <NA> 19158 386 2.014824094 4
#> 31 education_level Graduate 19158 11598 60.538678359 1
#> 32 education_level Masters 19158 4361 22.763336465 2
#> 33 education_level High School 19158 2017 10.528238856 3
#> 34 education_level <NA> 19158 460 2.401085708 4
#> 35 education_level Phd 19158 414 2.160977137 5
#> 36 education_level Primary School 19158 308 1.607683474 6
#> 37 major_discipline STEM 19158 14492 75.644639315 1
#> 38 major_discipline <NA> 19158 2813 14.683161082 2
#> 39 major_discipline Humanities 19158 669 3.492013780 3
#> 40 major_discipline Other 19158 381 1.988725337 4
#> 41 major_discipline Business Degree 19158 327 1.706858754 5
#> 42 major_discipline Arts 19158 253 1.320597140 6
#> 43 major_discipline No Major 19158 223 1.164004593 7
#> 44 experience >20 19158 3286 17.152103560 1
#> 45 experience 5 19158 1430 7.464244702 2
#> 46 experience 4 19158 1403 7.323311410 3
#> 47 experience 3 19158 1354 7.067543585 4
#> 48 experience 6 19158 1216 6.347217872 5
#> 49 experience 2 19158 1127 5.882659985 6
#> 50 experience 7 19158 1028 5.365904583 7
#> 51 experience 10 19158 985 5.141455267 8
#> 52 experience 9 19158 980 5.115356509 9
#> 53 experience 8 19158 802 4.186240735 10
#> 54 company_size <NA> 19158 5938 30.994884643 1
#> 55 company_size 50-99 19158 3083 16.092493997 2
#> 56 company_size 100-499 19158 2571 13.419981209 3
#> 57 company_size 10000+ 19158 2019 10.538678359 4
#> 58 company_size 10-49 19158 1471 7.678254515 5
#> 59 company_size 1000-4999 19158 1328 6.931830045 6
#> 60 company_size <10 19158 1308 6.827435014 7
#> 61 company_size 500-999 19158 877 4.577722100 8
#> 62 company_size 5000-9999 19158 563 2.938720117 9
#> 63 company_type Pvt Ltd 19158 9817 51.242300866 1
#> 64 company_type <NA> 19158 6140 32.049274455 2
#> 65 company_type Funded Startup 19158 1001 5.224971291 3
#> 66 company_type Public Sector 19158 955 4.984862721 4
#> 67 company_type Early Stage Startup 19158 603 3.147510179 5
#> 68 company_type NGO 19158 521 2.719490552 6
#> 69 company_type Other 19158 121 0.631589936 7
#> 70 last_new_job 1 19158 8040 41.966802380 1
#> 71 last_new_job >4 19158 3290 17.172982566 2
#> 72 last_new_job 2 19158 2900 15.137279465 3
#> 73 last_new_job never 19158 2452 12.798830776 4
#> 74 last_new_job 4 19158 1029 5.371124334 5
#> 75 last_new_job 3 19158 1024 5.345025577 6
#> 76 last_new_job <NA> 19158 423 2.207954901 7
#> 77 job_chnge No 19158 14381 75.065246894 1
#> 78 job_chnge Yes 19158 4777 24.934753106 2
# Select the variable to diagnose
diagnose_category(jobchange, education_level, company_type)
#> variables levels N freq ratio rank
#> 1 education_level Graduate 19158 11598 60.5386784 1
#> 2 education_level Masters 19158 4361 22.7633365 2
#> 3 education_level High School 19158 2017 10.5282389 3
#> 4 education_level <NA> 19158 460 2.4010857 4
#> 5 education_level Phd 19158 414 2.1609771 5
#> 6 education_level Primary School 19158 308 1.6076835 6
#> 7 company_type Pvt Ltd 19158 9817 51.2423009 1
#> 8 company_type <NA> 19158 6140 32.0492745 2
#> 9 company_type Funded Startup 19158 1001 5.2249713 3
#> 10 company_type Public Sector 19158 955 4.9848627 4
#> 11 company_type Early Stage Startup 19158 603 3.1475102 5
#> 12 company_type NGO 19158 521 2.7194906 6
#> 13 company_type Other 19158 121 0.6315899 7
# Using pipes ---------------------------------
library(dplyr)
# Diagnosis of all categorical variables
jobchange %>%
diagnose_category()
#> variables levels N freq ratio rank
#> 1 enrollee_id 1 19158 1 0.005219752 1
#> 2 enrollee_id 10 19158 1 0.005219752 1
#> 3 enrollee_id 10000 19158 1 0.005219752 1
#> 4 enrollee_id 10001 19158 1 0.005219752 1
#> 5 enrollee_id 10002 19158 1 0.005219752 1
#> 6 enrollee_id 10003 19158 1 0.005219752 1
#> 7 enrollee_id 10004 19158 1 0.005219752 1
#> 8 enrollee_id 10005 19158 1 0.005219752 1
#> 9 enrollee_id 10006 19158 1 0.005219752 1
#> 10 enrollee_id 10008 19158 1 0.005219752 1
#> 11 city city_103 19158 4355 22.732017956 1
#> 12 city city_21 19158 2702 14.103768661 2
#> 13 city city_16 19158 1533 8.001879111 3
#> 14 city city_114 19158 1336 6.973588057 4
#> 15 city city_160 19158 845 4.410690051 5
#> 16 city city_136 19158 586 3.058774402 6
#> 17 city city_67 19158 431 2.249712914 7
#> 18 city city_75 19158 305 1.592024220 8
#> 19 city city_102 19158 304 1.586804468 9
#> 20 city city_104 19158 301 1.571145213 10
#> 21 gender Male 19158 13221 69.010335108 1
#> 22 gender <NA> 19158 4508 23.530639942 2
#> 23 gender Female 19158 1238 6.462052406 3
#> 24 gender Other 19158 191 0.996972544 4
#> 25 relevent_experience Has relevent experience 19158 13792 71.990813237 1
#> 26 relevent_experience No relevent experience 19158 5366 28.009186763 2
#> 27 enrolled_university no_enrollment 19158 13817 72.121307026 1
#> 28 enrolled_university Full time course 19158 3757 19.610606535 2
#> 29 enrolled_university Part time course 19158 1198 6.253262345 3
#> 30 enrolled_university <NA> 19158 386 2.014824094 4
#> 31 education_level Graduate 19158 11598 60.538678359 1
#> 32 education_level Masters 19158 4361 22.763336465 2
#> 33 education_level High School 19158 2017 10.528238856 3
#> 34 education_level <NA> 19158 460 2.401085708 4
#> 35 education_level Phd 19158 414 2.160977137 5
#> 36 education_level Primary School 19158 308 1.607683474 6
#> 37 major_discipline STEM 19158 14492 75.644639315 1
#> 38 major_discipline <NA> 19158 2813 14.683161082 2
#> 39 major_discipline Humanities 19158 669 3.492013780 3
#> 40 major_discipline Other 19158 381 1.988725337 4
#> 41 major_discipline Business Degree 19158 327 1.706858754 5
#> 42 major_discipline Arts 19158 253 1.320597140 6
#> 43 major_discipline No Major 19158 223 1.164004593 7
#> 44 experience >20 19158 3286 17.152103560 1
#> 45 experience 5 19158 1430 7.464244702 2
#> 46 experience 4 19158 1403 7.323311410 3
#> 47 experience 3 19158 1354 7.067543585 4
#> 48 experience 6 19158 1216 6.347217872 5
#> 49 experience 2 19158 1127 5.882659985 6
#> 50 experience 7 19158 1028 5.365904583 7
#> 51 experience 10 19158 985 5.141455267 8
#> 52 experience 9 19158 980 5.115356509 9
#> 53 experience 8 19158 802 4.186240735 10
#> 54 company_size <NA> 19158 5938 30.994884643 1
#> 55 company_size 50-99 19158 3083 16.092493997 2
#> 56 company_size 100-499 19158 2571 13.419981209 3
#> 57 company_size 10000+ 19158 2019 10.538678359 4
#> 58 company_size 10-49 19158 1471 7.678254515 5
#> 59 company_size 1000-4999 19158 1328 6.931830045 6
#> 60 company_size <10 19158 1308 6.827435014 7
#> 61 company_size 500-999 19158 877 4.577722100 8
#> 62 company_size 5000-9999 19158 563 2.938720117 9
#> 63 company_type Pvt Ltd 19158 9817 51.242300866 1
#> 64 company_type <NA> 19158 6140 32.049274455 2
#> 65 company_type Funded Startup 19158 1001 5.224971291 3
#> 66 company_type Public Sector 19158 955 4.984862721 4
#> 67 company_type Early Stage Startup 19158 603 3.147510179 5
#> 68 company_type NGO 19158 521 2.719490552 6
#> 69 company_type Other 19158 121 0.631589936 7
#> 70 last_new_job 1 19158 8040 41.966802380 1
#> 71 last_new_job >4 19158 3290 17.172982566 2
#> 72 last_new_job 2 19158 2900 15.137279465 3
#> 73 last_new_job never 19158 2452 12.798830776 4
#> 74 last_new_job 4 19158 1029 5.371124334 5
#> 75 last_new_job 3 19158 1024 5.345025577 6
#> 76 last_new_job <NA> 19158 423 2.207954901 7
#> 77 job_chnge No 19158 14381 75.065246894 1
#> 78 job_chnge Yes 19158 4777 24.934753106 2
# Positive values select variables
jobchange %>%
diagnose_category(company_type, job_chnge)
#> variables levels N freq ratio rank
#> 1 company_type Pvt Ltd 19158 9817 51.2423009 1
#> 2 company_type <NA> 19158 6140 32.0492745 2
#> 3 company_type Funded Startup 19158 1001 5.2249713 3
#> 4 company_type Public Sector 19158 955 4.9848627 4
#> 5 company_type Early Stage Startup 19158 603 3.1475102 5
#> 6 company_type NGO 19158 521 2.7194906 6
#> 7 company_type Other 19158 121 0.6315899 7
#> 8 job_chnge No 19158 14381 75.0652469 1
#> 9 job_chnge Yes 19158 4777 24.9347531 2
# Negative values to drop variables
jobchange %>%
diagnose_category(-company_type, -job_chnge)
#> variables levels N freq ratio rank
#> 1 enrollee_id 1 19158 1 0.005219752 1
#> 2 enrollee_id 10 19158 1 0.005219752 1
#> 3 enrollee_id 10000 19158 1 0.005219752 1
#> 4 enrollee_id 10001 19158 1 0.005219752 1
#> 5 enrollee_id 10002 19158 1 0.005219752 1
#> 6 enrollee_id 10003 19158 1 0.005219752 1
#> 7 enrollee_id 10004 19158 1 0.005219752 1
#> 8 enrollee_id 10005 19158 1 0.005219752 1
#> 9 enrollee_id 10006 19158 1 0.005219752 1
#> 10 enrollee_id 10008 19158 1 0.005219752 1
#> 11 city city_103 19158 4355 22.732017956 1
#> 12 city city_21 19158 2702 14.103768661 2
#> 13 city city_16 19158 1533 8.001879111 3
#> 14 city city_114 19158 1336 6.973588057 4
#> 15 city city_160 19158 845 4.410690051 5
#> 16 city city_136 19158 586 3.058774402 6
#> 17 city city_67 19158 431 2.249712914 7
#> 18 city city_75 19158 305 1.592024220 8
#> 19 city city_102 19158 304 1.586804468 9
#> 20 city city_104 19158 301 1.571145213 10
#> 21 gender Male 19158 13221 69.010335108 1
#> 22 gender <NA> 19158 4508 23.530639942 2
#> 23 gender Female 19158 1238 6.462052406 3
#> 24 gender Other 19158 191 0.996972544 4
#> 25 relevent_experience Has relevent experience 19158 13792 71.990813237 1
#> 26 relevent_experience No relevent experience 19158 5366 28.009186763 2
#> 27 enrolled_university no_enrollment 19158 13817 72.121307026 1
#> 28 enrolled_university Full time course 19158 3757 19.610606535 2
#> 29 enrolled_university Part time course 19158 1198 6.253262345 3
#> 30 enrolled_university <NA> 19158 386 2.014824094 4
#> 31 education_level Graduate 19158 11598 60.538678359 1
#> 32 education_level Masters 19158 4361 22.763336465 2
#> 33 education_level High School 19158 2017 10.528238856 3
#> 34 education_level <NA> 19158 460 2.401085708 4
#> 35 education_level Phd 19158 414 2.160977137 5
#> 36 education_level Primary School 19158 308 1.607683474 6
#> 37 major_discipline STEM 19158 14492 75.644639315 1
#> 38 major_discipline <NA> 19158 2813 14.683161082 2
#> 39 major_discipline Humanities 19158 669 3.492013780 3
#> 40 major_discipline Other 19158 381 1.988725337 4
#> 41 major_discipline Business Degree 19158 327 1.706858754 5
#> 42 major_discipline Arts 19158 253 1.320597140 6
#> 43 major_discipline No Major 19158 223 1.164004593 7
#> 44 experience >20 19158 3286 17.152103560 1
#> 45 experience 5 19158 1430 7.464244702 2
#> 46 experience 4 19158 1403 7.323311410 3
#> 47 experience 3 19158 1354 7.067543585 4
#> 48 experience 6 19158 1216 6.347217872 5
#> 49 experience 2 19158 1127 5.882659985 6
#> 50 experience 7 19158 1028 5.365904583 7
#> 51 experience 10 19158 985 5.141455267 8
#> 52 experience 9 19158 980 5.115356509 9
#> 53 experience 8 19158 802 4.186240735 10
#> 54 company_size <NA> 19158 5938 30.994884643 1
#> 55 company_size 50-99 19158 3083 16.092493997 2
#> 56 company_size 100-499 19158 2571 13.419981209 3
#> 57 company_size 10000+ 19158 2019 10.538678359 4
#> 58 company_size 10-49 19158 1471 7.678254515 5
#> 59 company_size 1000-4999 19158 1328 6.931830045 6
#> 60 company_size <10 19158 1308 6.827435014 7
#> 61 company_size 500-999 19158 877 4.577722100 8
#> 62 company_size 5000-9999 19158 563 2.938720117 9
#> 63 last_new_job 1 19158 8040 41.966802380 1
#> 64 last_new_job >4 19158 3290 17.172982566 2
#> 65 last_new_job 2 19158 2900 15.137279465 3
#> 66 last_new_job never 19158 2452 12.798830776 4
#> 67 last_new_job 4 19158 1029 5.371124334 5
#> 68 last_new_job 3 19158 1024 5.345025577 6
#> 69 last_new_job <NA> 19158 423 2.207954901 7
# Top rank levels with top argument
jobchange %>%
diagnose_category(top = 2)
#> variables levels N freq ratio rank
#> 1 enrollee_id 1 19158 1 0.005219752 1
#> 2 enrollee_id 10 19158 1 0.005219752 1
#> 3 city city_103 19158 4355 22.732017956 1
#> 4 city city_21 19158 2702 14.103768661 2
#> 5 gender Male 19158 13221 69.010335108 1
#> 6 gender <NA> 19158 4508 23.530639942 2
#> 7 relevent_experience Has relevent experience 19158 13792 71.990813237 1
#> 8 relevent_experience No relevent experience 19158 5366 28.009186763 2
#> 9 enrolled_university no_enrollment 19158 13817 72.121307026 1
#> 10 enrolled_university Full time course 19158 3757 19.610606535 2
#> 11 education_level Graduate 19158 11598 60.538678359 1
#> 12 education_level Masters 19158 4361 22.763336465 2
#> 13 major_discipline STEM 19158 14492 75.644639315 1
#> 14 major_discipline <NA> 19158 2813 14.683161082 2
#> 15 experience >20 19158 3286 17.152103560 1
#> 16 experience 5 19158 1430 7.464244702 2
#> 17 company_size <NA> 19158 5938 30.994884643 1
#> 18 company_size 50-99 19158 3083 16.092493997 2
#> 19 company_type Pvt Ltd 19158 9817 51.242300866 1
#> 20 company_type <NA> 19158 6140 32.049274455 2
#> 21 last_new_job 1 19158 8040 41.966802380 1
#> 22 last_new_job >4 19158 3290 17.172982566 2
#> 23 job_chnge No 19158 14381 75.065246894 1
#> 24 job_chnge Yes 19158 4777 24.934753106 2
# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60% of categorical data
jobchange %>%
diagnose_category() %>%
filter(ratio >= 60)
#> variables levels N freq ratio rank
#> 1 gender Male 19158 13221 69.01034 1
#> 2 relevent_experience Has relevent experience 19158 13792 71.99081 1
#> 3 enrolled_university no_enrollment 19158 13817 72.12131 1
#> 4 education_level Graduate 19158 11598 60.53868 1
#> 5 major_discipline STEM 19158 14492 75.64464 1
#> 6 job_chnge No 19158 14381 75.06525 1
# All observations of enrollee_id have a rank of 1.
# Because it is a unique identifier. Therefore, if you select up to the top rank 3,
# all records are displayed. It will probably fill your screen.
# extract rows that less than equal rank 3
# default of type argument is "n"
jobchange %>%
diagnose_category(enrollee_id, top = 3)
#> # A tibble: 3 × 6
#> variables levels N freq ratio rank
#> <chr> <chr> <int> <int> <dbl> <int>
#> 1 enrollee_id 1 19158 1 0.00522 1
#> 2 enrollee_id 10 19158 1 0.00522 1
#> 3 enrollee_id 10000 19158 1 0.00522 1
# extract rows that less than equal rank 3
jobchange %>%
diagnose_category(enrollee_id, top = 3, type = "rank")
#> # A tibble: 19,158 × 6
#> variables levels N freq ratio rank
#> <chr> <chr> <int> <int> <dbl> <int>
#> 1 enrollee_id 1 19158 1 0.00522 1
#> 2 enrollee_id 10 19158 1 0.00522 1
#> 3 enrollee_id 10000 19158 1 0.00522 1
#> 4 enrollee_id 10001 19158 1 0.00522 1
#> 5 enrollee_id 10002 19158 1 0.00522 1
#> 6 enrollee_id 10003 19158 1 0.00522 1
#> 7 enrollee_id 10004 19158 1 0.00522 1
#> 8 enrollee_id 10005 19158 1 0.00522 1
#> 9 enrollee_id 10006 19158 1 0.00522 1
#> 10 enrollee_id 10008 19158 1 0.00522 1
#> # ℹ 19,148 more rows
# extract only 3 rows
jobchange %>%
diagnose_category(enrollee_id, top = 3, type = "n")
#> # A tibble: 3 × 6
#> variables levels N freq ratio rank
#> <chr> <chr> <int> <int> <dbl> <int>
#> 1 enrollee_id 1 19158 1 0.00522 1
#> 2 enrollee_id 10 19158 1 0.00522 1
#> 3 enrollee_id 10000 19158 1 0.00522 1
# Using group_by ------------------------------
# Calculate the diagnosis of 'company_type' variable by 'job_chnge' using group_by()
jobchange %>%
group_by(job_chnge) %>%
diagnose_category(company_type)
#> # A tibble: 14 × 7
#> variables job_chnge levels N freq ratio rank
#> <chr> <fct> <chr> <int> <int> <dbl> <int>
#> 1 company_type No Pvt Ltd 14381 8042 55.9 1
#> 2 company_type No NA 14381 3756 26.1 2
#> 3 company_type No Funded Startup 14381 861 5.99 3
#> 4 company_type No Public Sector 14381 745 5.18 4
#> 5 company_type No Early Stage Startup 14381 461 3.21 5
#> 6 company_type No NGO 14381 424 2.95 6
#> 7 company_type No Other 14381 92 0.640 7
#> 8 company_type Yes NA 4777 2384 49.9 1
#> 9 company_type Yes Pvt Ltd 4777 1775 37.2 2
#> 10 company_type Yes Public Sector 4777 210 4.40 3
#> 11 company_type Yes Early Stage Startup 4777 142 2.97 4
#> 12 company_type Yes Funded Startup 4777 140 2.93 5
#> 13 company_type Yes NGO 4777 97 2.03 6
#> 14 company_type Yes Other 4777 29 0.607 7
# }