Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.

compare_diag(
  .data,
  add_character = FALSE,
  uniq_thres = 0.01,
  miss_msg = TRUE,
  verbose = TRUE
)

Arguments

.data

an object of class "split_df", usually, a result of a call to split_df().

add_character

logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.

uniq_thres

numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.

miss_msg

logical. Set whether to output a message when diagnosing missing value.

verbose

logical. Set whether to echo information to the console at runtime.

Value

list. Variables of tbl_df for first component named "single_value":

  • variables : character. variable name

  • train_uniq : character. the type of unique value in train set. it is divided into "single" and "multi".

  • test_uniq : character. the type of unique value in test set. it is divided into "single" and "multi".

Variables of tbl_df for second component named "uniq_rate":

  • variables : character. categorical variable name

  • train_uniqcount : numeric. the number of unique value in train set

  • train_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in train set

  • test_uniqcount : numeric. the number of unique value in test set

  • test_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in test set

Variables of tbl_df for third component named "missing_level":

  • variables : character. variable name

  • n_levels : integer. count of level of categorical variable

  • train_missing_nlevel : integer. the number of non-existent levels in the train set

  • test_missing_nlevel : integer. he number of non-existent levels in the test set

Details

In the two split datasets, a variable with a single value, a variable with a level not found in any dataset, and a variable with a high ratio to the number of levels are diagnosed.

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

defaults <- ISLR::Default
defaults$id <- seq(NROW(defaults))

set.seed(1)
defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA
set.seed(2)
defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA

sb <- defaults %>%
  split_by(default)

sb %>%
  compare_diag()
#> * Detected diagnose missing value
#>  - student
#>  - balance
#>  - balance
#> 
#> * Detected diagnose missing levels
#>  - student
#> $missing_value
#> # A tibble: 3 × 4
#>   variables train_misscount train_missrate test_missrate
#>   <chr>               <int>          <dbl>         <dbl>
#> 1 student                 3         0.0429       NA     
#> 2 balance                 8         0.114        NA     
#> 3 balance                 2        NA             0.0667
#> 
#> $single_value
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variables <chr>, train_uniq <lgl>, test_uniq <lgl>
#> 
#> $uniq_rate
#> # A tibble: 0 × 5
#> # ℹ 5 variables: variables <chr>, train_uniqcount <int>, train_uniqrate <dbl>,
#> #   test_uniqcount <int>, test_uniqrate <dbl>
#> 
#> $missing_level
#> # A tibble: 1 × 4
#>   variables n_levels train_missing_nlevel test_missing_nlevel
#>   <chr>        <int>                <int>               <int>
#> 1 student          3                    0                   1
#> 

sb %>%
  compare_diag(add_character = TRUE)
#> * Detected diagnose missing value
#>  - student
#>  - balance
#>  - balance
#> 
#> * Detected diagnose missing levels
#>  - student
#> $missing_value
#> # A tibble: 3 × 4
#>   variables train_misscount train_missrate test_missrate
#>   <chr>               <int>          <dbl>         <dbl>
#> 1 student                 3         0.0429       NA     
#> 2 balance                 8         0.114        NA     
#> 3 balance                 2        NA             0.0667
#> 
#> $single_value
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variables <chr>, train_uniq <lgl>, test_uniq <lgl>
#> 
#> $uniq_rate
#> # A tibble: 0 × 5
#> # ℹ 5 variables: variables <chr>, train_uniqcount <int>, train_uniqrate <dbl>,
#> #   test_uniqcount <int>, test_uniqrate <dbl>
#> 
#> $missing_level
#> # A tibble: 1 × 4
#>   variables n_levels train_missing_nlevel test_missing_nlevel
#>   <chr>        <int>                <int>               <int>
#> 1 student          3                    0                   1
#> 

sb %>%
  compare_diag(uniq_thres = 0.0005)
#> * Detected diagnose missing value
#>  - student
#>  - balance
#>  - balance
#> 
#> * Detected diagnose many unique value
#>  - default
#>  - student
#> 
#> * Detected diagnose missing levels
#>  - student
#> $missing_value
#> # A tibble: 3 × 4
#>   variables train_misscount train_missrate test_missrate
#>   <chr>               <int>          <dbl>         <dbl>
#> 1 student                 3         0.0429       NA     
#> 2 balance                 8         0.114        NA     
#> 3 balance                 2        NA             0.0667
#> 
#> $single_value
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variables <chr>, train_uniq <lgl>, test_uniq <lgl>
#> 
#> $uniq_rate
#> # A tibble: 2 × 5
#>   variables train_uniqcount train_uniqrate test_uniqcount test_uniqrate
#>   <chr>               <int>          <dbl>          <int>         <dbl>
#> 1 default                NA             NA              2      0.000667
#> 2 student                NA             NA              2      0.000667
#> 
#> $missing_level
#> # A tibble: 1 × 4
#>   variables n_levels train_missing_nlevel test_missing_nlevel
#>   <chr>        <int>                <int>               <int>
#> 1 student          3                    0                   1
#>