Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class. and cleansing the "split_df" class
# S3 method for split_df
cleanse(.data, add_character = FALSE, uniq_thres = 0.9, missing = FALSE, ...)
an object of class "split_df", usually, a result of a call to split_df().
logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.
numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.
logical. Set whether to removing variables including missing value
further arguments passed to or from other methods.
An object of class "split_df".
Remove the detected variables from the diagnosis using the compare_diag() function.
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:randomForest’:
#>
#> combine
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# Credit Card Default Data
head(ISLR::Default)
#> default student balance income
#> 1 No No 729.5265 44361.625
#> 2 No Yes 817.1804 12106.135
#> 3 No No 1073.5492 31767.139
#> 4 No No 529.2506 35704.494
#> 5 No No 785.6559 38463.496
#> 6 No Yes 919.5885 7491.559
# Generate data for the example
sb <- ISLR::Default %>%
split_by(default)
sb %>%
cleanse
#> There were no diagnostics issues
#> # A tibble: 10,000 × 5
#> # Groups: split_flag [2]
#> default student balance income split_flag
#> <fct> <fct> <dbl> <dbl> <chr>
#> 1 No No 730. 44362. train
#> 2 No Yes 817. 12106. train
#> 3 No No 1074. 31767. train
#> 4 No No 529. 35704. train
#> 5 No No 786. 38463. train
#> 6 No Yes 920. 7492. train
#> 7 No No 826. 24905. test
#> 8 No Yes 809. 17600. test
#> 9 No No 1161. 37469. train
#> 10 No No 0 29275. train
#> # ℹ 9,990 more rows