Cleansing the dataset for classification modeling

Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class. and cleansing the "split_df" class

# S3 method for split_df
cleanse(.data, add_character = FALSE, uniq_thres = 0.9, missing = FALSE, ...)

Arguments

.data: an object of class "split_df", usually, a result of a call to split_df().
add_character: logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.
uniq_thres: numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.
missing: logical. Set whether to removing variables including missing value
...: further arguments passed to or from other methods.

Value

An object of class "split_df".

Details

Remove the detected variables from the diagnosis using the compare_diag() function.

Examples

library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:randomForest’:
#> 
#>     combine
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

sb %>%
  cleanse
#> There were no diagnostics issues
#> # A tibble: 10,000 × 5
#> # Groups:   split_flag [2]
#>    default student balance income split_flag
#>    <fct>   <fct>     <dbl>  <dbl> <chr>     
#>  1 No      No         730. 44362. train     
#>  2 No      Yes        817. 12106. train     
#>  3 No      No        1074. 31767. train     
#>  4 No      No         529. 35704. train     
#>  5 No      No         786. 38463. train     
#>  6 No      Yes        920.  7492. train     
#>  7 No      No         826. 24905. test      
#>  8 No      Yes        809. 17600. test      
#>  9 No      No        1161. 37469. train     
#> 10 No      No           0  29275. train     
#> # ℹ 9,990 more rows