Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class. and cleansing the "split_df" class

# S3 method for split_df
cleanse(.data, add_character = FALSE, uniq_thres = 0.9, missing = FALSE, ...)

Arguments

.data

an object of class "split_df", usually, a result of a call to split_df().

add_character

logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.

uniq_thres

numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.

missing

logical. Set whether to removing variables including missing value

...

further arguments passed to or from other methods.

Value

An object of class "split_df".

Details

Remove the detected variables from the diagnosis using the compare_diag() function.

Examples

library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:randomForest’:
#> 
#>     combine
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

sb %>%
  cleanse
#> There were no diagnostics issues
#> # A tibble: 10,000 × 5
#> # Groups:   split_flag [2]
#>    default student balance income split_flag
#>    <fct>   <fct>     <dbl>  <dbl> <chr>     
#>  1 No      No         730. 44362. train     
#>  2 No      Yes        817. 12106. train     
#>  3 No      No        1074. 31767. train     
#>  4 No      No         529. 35704. train     
#>  5 No      No         786. 38463. train     
#>  6 No      Yes        920.  7492. train     
#>  7 No      No         826. 24905. test      
#>  8 No      Yes        809. 17600. test      
#>  9 No      No        1161. 37469. train     
#> 10 No      No           0  29275. train     
#> # ℹ 9,990 more rows