R/tbl_dbi.R
diagnose_category.tbl_dbi.Rd
The diagnose_category() produces information for diagnosing the quality of the character(CHAR, VARCHAR, VARCHAR2, etc.) column of the DBMS table through tbl_dbi.
# S3 method for tbl_dbi diagnose_category( .data, ..., top = 10, type = c("rank", "n")[1], in_database = TRUE, collect_size = Inf )
.data | a tbl_dbi. |
---|---|
... | one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing. |
top | an integer. Specifies the upper top rank to extract. Default is 10. |
type | a character string specifying how result are extracted. Default is "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. "n" extract top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n". |
in_database | Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory. |
collect_size | a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE. |
an object of tbl_df.
The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100 then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0 variable is likely to be an identifier.
The information derived from the categorical data diagnosis is as follows.
variables : variable names
levels: level names
N : number of observation
freq : number of observation at the levels
ratio : percentage of observation at the levels
rank : rank of occupancy ratio of levels
See vignette("diagonosis") for an introduction to these concepts.
diagnose_category.data.frame
, diagnose.tbl_dbi
, diagnose_category.tbl_dbi
, diagnose_numeric.tbl_dbi
, diagnose_outlier.tbl_dbi
.
# \donttest{ library(dplyr) # connect DBMS con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # copy jobchange to the DBMS with a table named TB_JOBCHANGE copy_to(con_sqlite, jobchange, name = "TB_JOBCHANGE", overwrite = TRUE) # Using pipes --------------------------------- # Diagnosis of all categorical variables con_sqlite %>% tbl("TB_JOBCHANGE") %>% diagnose_category()#> # A tibble: 19,226 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 9998 19158 1 0.00522 1 #> 2 enrollee_id 9995 19158 1 0.00522 1 #> 3 enrollee_id 9992 19158 1 0.00522 1 #> 4 enrollee_id 999 19158 1 0.00522 1 #> 5 enrollee_id 9989 19158 1 0.00522 1 #> 6 enrollee_id 9988 19158 1 0.00522 1 #> 7 enrollee_id 9987 19158 1 0.00522 1 #> 8 enrollee_id 9986 19158 1 0.00522 1 #> 9 enrollee_id 9985 19158 1 0.00522 1 #> 10 enrollee_id 9984 19158 1 0.00522 1 #> # … with 19,216 more rows# Positive values select variables con_sqlite %>% tbl("TB_JOBCHANGE") %>% diagnose_category(company_type, job_chnge)#> # A tibble: 9 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 company_type Pvt Ltd 19158 9817 51.2 1 #> 2 company_type NA 19158 6140 32.0 2 #> 3 company_type Funded Startup 19158 1001 5.22 3 #> 4 company_type Public Sector 19158 955 4.98 4 #> 5 company_type Early Stage Startup 19158 603 3.15 5 #> 6 company_type NGO 19158 521 2.72 6 #> 7 company_type Other 19158 121 0.632 7 #> 8 job_chnge No 19158 14381 75.1 1 #> 9 job_chnge Yes 19158 4777 24.9 2# Negative values to drop variables, and In-memory mode con_sqlite %>% tbl("TB_JOBCHANGE") %>% diagnose_category(-company_type, -job_chnge, in_database = FALSE)#> # A tibble: 19,217 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 1 19158 1 0.00522 1 #> 2 enrollee_id 10 19158 1 0.00522 1 #> 3 enrollee_id 10000 19158 1 0.00522 1 #> 4 enrollee_id 10001 19158 1 0.00522 1 #> 5 enrollee_id 10002 19158 1 0.00522 1 #> 6 enrollee_id 10003 19158 1 0.00522 1 #> 7 enrollee_id 10004 19158 1 0.00522 1 #> 8 enrollee_id 10005 19158 1 0.00522 1 #> 9 enrollee_id 10006 19158 1 0.00522 1 #> 10 enrollee_id 10008 19158 1 0.00522 1 #> # … with 19,207 more rows# Positions values select variables, and In-memory mode and collect size is 200 con_sqlite %>% tbl("TB_JOBCHANGE") %>% diagnose_category(7, in_database = FALSE, collect_size = 200)#> # A tibble: 5 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 education_level Graduate 200 126 63 1 #> 2 education_level Masters 200 41 20.5 2 #> 3 education_level High School 200 22 11 3 #> 4 education_level NA 200 6 3 4 #> 5 education_level Phd 200 5 2.5 5#> # A tibble: 19,220 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 9998 19158 1 0.00522 1 #> 2 enrollee_id 9995 19158 1 0.00522 1 #> 3 enrollee_id 9992 19158 1 0.00522 1 #> 4 enrollee_id 999 19158 1 0.00522 1 #> 5 enrollee_id 9989 19158 1 0.00522 1 #> 6 enrollee_id 9988 19158 1 0.00522 1 #> 7 enrollee_id 9987 19158 1 0.00522 1 #> 8 enrollee_id 9986 19158 1 0.00522 1 #> 9 enrollee_id 9985 19158 1 0.00522 1 #> 10 enrollee_id 9984 19158 1 0.00522 1 #> # … with 19,210 more rows# Top rank levels with top argument con_sqlite %>% tbl("TB_JOBCHANGE") %>% diagnose_category(top = 2)#> # A tibble: 19,180 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 enrollee_id 9998 19158 1 0.00522 1 #> 2 enrollee_id 9995 19158 1 0.00522 1 #> 3 enrollee_id 9992 19158 1 0.00522 1 #> 4 enrollee_id 999 19158 1 0.00522 1 #> 5 enrollee_id 9989 19158 1 0.00522 1 #> 6 enrollee_id 9988 19158 1 0.00522 1 #> 7 enrollee_id 9987 19158 1 0.00522 1 #> 8 enrollee_id 9986 19158 1 0.00522 1 #> 9 enrollee_id 9985 19158 1 0.00522 1 #> 10 enrollee_id 9984 19158 1 0.00522 1 #> # … with 19,170 more rows# Using pipes & dplyr ------------------------- # Extraction of level that is more than 60% of categorical data con_sqlite %>% tbl("TB_JOBCHANGE") %>% diagnose_category() %>% filter(ratio >= 60)#> # A tibble: 6 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 gender Male 19158 13221 69.0 1 #> 2 relevent_experience Has relevent experience 19158 13792 72.0 1 #> 3 enrolled_university no_enrollment 19158 13817 72.1 1 #> 4 education_level Graduate 19158 11598 60.5 1 #> 5 major_discipline STEM 19158 14492 75.6 1 #> 6 job_chnge No 19158 14381 75.1 1# Using type argument ------------------------- dfm <- data.frame(alpabet = c(rep(letters[1:5], times = 5), "c")) # copy dfm to the DBMS with a table named TB_EXAMPLE copy_to(con_sqlite, dfm, name = "TB_EXAMPLE", overwrite = TRUE) # extract rows that less than equal rank 10 # default of top argument is 10 con_sqlite %>% tbl("TB_EXAMPLE") %>% diagnose_category()#> # A tibble: 5 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 alpabet c 26 6 23.1 1 #> 2 alpabet e 26 5 19.2 2 #> 3 alpabet d 26 5 19.2 2 #> 4 alpabet b 26 5 19.2 2 #> 5 alpabet a 26 5 19.2 2# extract rows that less than equal rank 2 con_sqlite %>% tbl("TB_EXAMPLE") %>% diagnose_category(top = 2, type = "rank")#> # A tibble: 5 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 alpabet c 26 6 23.1 1 #> 2 alpabet e 26 5 19.2 2 #> 3 alpabet d 26 5 19.2 2 #> 4 alpabet b 26 5 19.2 2 #> 5 alpabet a 26 5 19.2 2# extract rows that less than equal rank 2 # default of type argument is "rank" con_sqlite %>% tbl("TB_EXAMPLE") %>% diagnose_category(top = 2)#> # A tibble: 5 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 alpabet c 26 6 23.1 1 #> 2 alpabet e 26 5 19.2 2 #> 3 alpabet d 26 5 19.2 2 #> 4 alpabet b 26 5 19.2 2 #> 5 alpabet a 26 5 19.2 2#> # A tibble: 2 x 6 #> variables levels N freq ratio rank #> <chr> <chr> <int> <int> <dbl> <int> #> 1 alpabet c 26 6 23.1 1 #> 2 alpabet e 26 5 19.2 2