Comparison of categorical variables of train set and test set

Compare the statistics of the categorical variables of the train set and test set included in the "split_df" class.

compare_target_category(.data, ..., add_character = FALSE, margin = FALSE)

Arguments

.data: an object of class "split_df", usually, a result of a call to split_df().
...: one or more unquoted expressions separated by commas. Select the categorical variable you want to compare. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, compare_target_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.
add_character: logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.
margin: logical. Choose to calculate the marginal frequency information.

Value

tbl_df. Variables of tbl_df for comparison:

variable : character. categorical variable name
level : factor. level of categorical variables
train : numeric. the relative frequency of the level in the train set
test : numeric. the relative frequency of the level in the test set
abs_diff : numeric. the absolute value of the difference between two relative frequencies

Details

Compare the statistics of the numerical variables of the train set and the test set to determine whether the raw data is well separated into two data sets.

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

sb %>%
  compare_target_category()
#> # A tibble: 4 × 5
#>   variable level train  test abs_diff
#>   <chr>    <fct> <dbl> <dbl>    <dbl>
#> 1 default  No    96.7  96.7   0.00476
#> 2 default  Yes    3.33  3.33  0.00476
#> 3 student  No    70.3  71.1   0.724  
#> 4 student  Yes   29.7  28.9   0.724  

sb %>%
  compare_target_category(add_character = TRUE)
#> # A tibble: 4 × 5
#>   variable level train  test abs_diff
#>   <chr>    <fct> <dbl> <dbl>    <dbl>
#> 1 default  No    96.7  96.7   0.00476
#> 2 default  Yes    3.33  3.33  0.00476
#> 3 student  No    70.3  71.1   0.724  
#> 4 student  Yes   29.7  28.9   0.724  

sb %>%
  compare_target_category(margin = TRUE)
#> # A tibble: 6 × 5
#>   variable level    train   test abs_diff
#>   <chr>    <fct>    <dbl>  <dbl>    <dbl>
#> 1 default  No       96.7   96.7   0.00476
#> 2 default  Yes       3.33   3.33  0.00476
#> 3 default  <Total> 100    100     0.00952
#> 4 student  No       70.3   71.1   0.724  
#> 5 student  Yes      29.7   28.9   0.724  
#> 6 student  <Total> 100    100     1.45   

sb %>%
  compare_target_category(student)
#> # A tibble: 2 × 5
#>   variable level train  test abs_diff
#>   <chr>    <fct> <dbl> <dbl>    <dbl>
#> 1 student  No     70.3  71.1    0.724
#> 2 student  Yes    29.7  28.9    0.724

sb %>%
  compare_target_category(student, margin = TRUE)
#> # A tibble: 3 × 5
#>   variable level   train  test abs_diff
#>   <chr>    <fct>   <dbl> <dbl>    <dbl>
#> 1 student  No       70.3  71.1    0.724
#> 2 student  Yes      29.7  28.9    0.724
#> 3 student  <Total> 100   100      1.45