To solve the imbalanced class, perform sampling in the train set of split_df.
an object of class "split_df", usually, a result of a call to split_df().
character. sampling methods. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique).
integer. random seed used for sampling
integer. The percentage of positive class in the final dataset. It is used only in under-sampling. The default is 50. perc can not exceed 50.
integer. It is used only in over-sampling and SMOTE. If over-sampling and if K=0: sample with replacement from the minority class until we have the same number of instances in each class. under-sampling and if K>0: sample with replacement from the minority class until we have k-times the original number of minority instances. If SMOTE, the number of neighbours to consider as the pool from where the new examples are generated
integer. It is used only in SMOTE. per.over/100 is the number of new instances generated for each rare instance. If perc.over < 100 a single instance is generated.
integer. It is used only in SMOTE. perc.under/100 is the number of "normal" (majority class) instances that are randomly selected for each smoted observation.
An object of train_df.
In order to solve the problem of imbalanced class, sampling is performed by under sampling, over sampling, SMOTE method.
The attributes of the train_df class are as follows.:
sample_seed : integer. random seed used for sampling
method : character. sampling methods.
perc : integer. perc argument value
k : integer. k argument value
perc.over : integer. perc.over argument value
perc.under : integer. perc.under argument value
binary : logical. whether the target variable is a binary class
target : character. target variable name
minority : character. the level of the minority class
majority : character. the level of the majority class
library(dplyr)
# Credit Card Default Data
head(ISLR::Default)
#> default student balance income
#> 1 No No 729.5265 44361.625
#> 2 No Yes 817.1804 12106.135
#> 3 No No 1073.5492 31767.139
#> 4 No No 529.2506 35704.494
#> 5 No No 785.6559 38463.496
#> 6 No Yes 919.5885 7491.559
# Generate data for the example
sb <- ISLR::Default %>%
split_by(default)
# under-sampling with random seed
under <- sb %>%
sampling_target(seed = 1234L)
under %>%
count(default)
#> # A tibble: 2 × 2
#> default n
#> <fct> <int>
#> 1 No 6767
#> 2 Yes 233
# under-sampling with random seed, and minority class frequency is 40%
under40 <- sb %>%
sampling_target(seed = 1234L, perc = 40)
under40 %>%
count(default)
#> # A tibble: 2 × 2
#> default n
#> <fct> <int>
#> 1 No 6767
#> 2 Yes 233
# over-sampling with random seed
over <- sb %>%
sampling_target(method = "ubOver", seed = 1234L)
over %>%
count(default)
#> # A tibble: 2 × 2
#> default n
#> <fct> <int>
#> 1 No 6767
#> 2 Yes 6767
# over-sampling with random seed, and k = 10
over10 <- sb %>%
sampling_target(method = "ubOver", seed = 1234L, k = 10)
over10 %>%
count(default)
#> # A tibble: 2 × 2
#> default n
#> <fct> <int>
#> 1 No 6767
#> 2 Yes 2330
# SMOTE with random seed
smote <- sb %>%
sampling_target(method = "ubSMOTE", seed = 1234L)
smote %>%
count(default)
#> # A tibble: 2 × 2
#> default n
#> <fct> <int>
#> 1 No 932
#> 2 Yes 699
# SMOTE with random seed, and perc.under = 250
smote250 <- sb %>%
sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250)
smote250 %>%
count(default)
#> # A tibble: 2 × 2
#> default n
#> <fct> <int>
#> 1 No 1165
#> 2 Yes 699