Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.
compare_diag(
.data,
add_character = FALSE,
uniq_thres = 0.01,
miss_msg = TRUE,
verbose = TRUE
)
an object of class "split_df", usually, a result of a call to split_df().
logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.
numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.
logical. Set whether to output a message when diagnosing missing value.
logical. Set whether to echo information to the console at runtime.
list. Variables of tbl_df for first component named "single_value":
variables : character. variable name
train_uniq : character. the type of unique value in train set. it is divided into "single" and "multi".
test_uniq : character. the type of unique value in test set. it is divided into "single" and "multi".
Variables of tbl_df for second component named "uniq_rate":
variables : character. categorical variable name
train_uniqcount : numeric. the number of unique value in train set
train_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in train set
test_uniqcount : numeric. the number of unique value in test set
test_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in test set
Variables of tbl_df for third component named "missing_level":
variables : character. variable name
n_levels : integer. count of level of categorical variable
train_missing_nlevel : integer. the number of non-existent levels in the train set
test_missing_nlevel : integer. he number of non-existent levels in the test set
In the two split datasets, a variable with a single value, a variable with a level not found in any dataset, and a variable with a high ratio to the number of levels are diagnosed.
library(dplyr)
# Credit Card Default Data
head(ISLR::Default)
#> default student balance income
#> 1 No No 729.5265 44361.625
#> 2 No Yes 817.1804 12106.135
#> 3 No No 1073.5492 31767.139
#> 4 No No 529.2506 35704.494
#> 5 No No 785.6559 38463.496
#> 6 No Yes 919.5885 7491.559
defaults <- ISLR::Default
defaults$id <- seq(NROW(defaults))
set.seed(1)
defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA
set.seed(2)
defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA
sb <- defaults %>%
split_by(default)
sb %>%
compare_diag()
#> * Detected diagnose missing value
#> - student
#> - balance
#> - balance
#>
#> * Detected diagnose missing levels
#> - student
#> $missing_value
#> # A tibble: 3 × 4
#> variables train_misscount train_missrate test_missrate
#> <chr> <int> <dbl> <dbl>
#> 1 student 3 0.0429 NA
#> 2 balance 8 0.114 NA
#> 3 balance 2 NA 0.0667
#>
#> $single_value
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variables <chr>, train_uniq <lgl>, test_uniq <lgl>
#>
#> $uniq_rate
#> # A tibble: 0 × 5
#> # ℹ 5 variables: variables <chr>, train_uniqcount <int>, train_uniqrate <dbl>,
#> # test_uniqcount <int>, test_uniqrate <dbl>
#>
#> $missing_level
#> # A tibble: 1 × 4
#> variables n_levels train_missing_nlevel test_missing_nlevel
#> <chr> <int> <int> <int>
#> 1 student 3 0 1
#>
sb %>%
compare_diag(add_character = TRUE)
#> * Detected diagnose missing value
#> - student
#> - balance
#> - balance
#>
#> * Detected diagnose missing levels
#> - student
#> $missing_value
#> # A tibble: 3 × 4
#> variables train_misscount train_missrate test_missrate
#> <chr> <int> <dbl> <dbl>
#> 1 student 3 0.0429 NA
#> 2 balance 8 0.114 NA
#> 3 balance 2 NA 0.0667
#>
#> $single_value
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variables <chr>, train_uniq <lgl>, test_uniq <lgl>
#>
#> $uniq_rate
#> # A tibble: 0 × 5
#> # ℹ 5 variables: variables <chr>, train_uniqcount <int>, train_uniqrate <dbl>,
#> # test_uniqcount <int>, test_uniqrate <dbl>
#>
#> $missing_level
#> # A tibble: 1 × 4
#> variables n_levels train_missing_nlevel test_missing_nlevel
#> <chr> <int> <int> <int>
#> 1 student 3 0 1
#>
sb %>%
compare_diag(uniq_thres = 0.0005)
#> * Detected diagnose missing value
#> - student
#> - balance
#> - balance
#>
#> * Detected diagnose many unique value
#> - default
#> - student
#>
#> * Detected diagnose missing levels
#> - student
#> $missing_value
#> # A tibble: 3 × 4
#> variables train_misscount train_missrate test_missrate
#> <chr> <int> <dbl> <dbl>
#> 1 student 3 0.0429 NA
#> 2 balance 8 0.114 NA
#> 3 balance 2 NA 0.0667
#>
#> $single_value
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variables <chr>, train_uniq <lgl>, test_uniq <lgl>
#>
#> $uniq_rate
#> # A tibble: 2 × 5
#> variables train_uniqcount train_uniqrate test_uniqcount test_uniqrate
#> <chr> <int> <dbl> <int> <dbl>
#> 1 default NA NA 2 0.000667
#> 2 student NA NA 2 0.000667
#>
#> $missing_level
#> # A tibble: 1 × 4
#> variables n_levels train_missing_nlevel test_missing_nlevel
#> <chr> <int> <int> <int>
#> 1 student 3 0 1
#>