Extract the data to fit the model — sampling

To solve the imbalanced class, perform sampling in the train set of split_df.

sampling_target(
  .data,
  method = c("ubUnder", "ubOver", "ubSMOTE"),
  seed = NULL,
  perc = 50,
  k = ifelse(method == "ubSMOTE", 5, 0),
  perc.over = 200,
  perc.under = 200
)

Arguments

.data: an object of class "split_df", usually, a result of a call to split_df().
method: character. sampling methods. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique).
seed: integer. random seed used for sampling
perc: integer. The percentage of positive class in the final dataset. It is used only in under-sampling. The default is 50. perc can not exceed 50.
k: integer. It is used only in over-sampling and SMOTE. If over-sampling and if K=0: sample with replacement from the minority class until we have the same number of instances in each class. under-sampling and if K>0: sample with replacement from the minority class until we have k-times the original number of minority instances. If SMOTE, the number of neighbours to consider as the pool from where the new examples are generated
perc.over: integer. It is used only in SMOTE. per.over/100 is the number of new instances generated for each rare instance. If perc.over < 100 a single instance is generated.
perc.under: integer. It is used only in SMOTE. perc.under/100 is the number of "normal" (majority class) instances that are randomly selected for each smoted observation.

Value

An object of train_df.

Details

In order to solve the problem of imbalanced class, sampling is performed by under sampling, over sampling, SMOTE method.

attributes of train_df class

The attributes of the train_df class are as follows.:

sample_seed : integer. random seed used for sampling
method : character. sampling methods.
perc : integer. perc argument value
k : integer. k argument value
perc.over : integer. perc.over argument value
perc.under : integer. perc.under argument value
binary : logical. whether the target variable is a binary class
target : character. target variable name
minority : character. the level of the minority class
majority : character. the level of the majority class

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

# under-sampling with random seed
under <- sb %>%
  sampling_target(seed = 1234L)

under %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes       233

# under-sampling with random seed, and minority class frequency is 40%
under40 <- sb %>%
  sampling_target(seed = 1234L, perc = 40)

under40 %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes       233

# over-sampling with random seed
over <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L)

over %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes      6767

# over-sampling with random seed, and k = 10
over10 <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L, k = 10)

over10 %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes      2330

# SMOTE with random seed
smote <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L)

smote %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No        932
#> 2 Yes       699

# SMOTE with random seed, and perc.under = 250
smote250 <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250)

smote250 %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       1165
#> 2 Yes       699