To solve the imbalanced class, perform sampling in the train set of split_df.

sampling_target(
  .data,
  method = c("ubUnder", "ubOver", "ubSMOTE"),
  seed = NULL,
  perc = 50,
  k = ifelse(method == "ubSMOTE", 5, 0),
  perc.over = 200,
  perc.under = 200
)

Arguments

.data

an object of class "split_df", usually, a result of a call to split_df().

method

character. sampling methods. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique).

seed

integer. random seed used for sampling

perc

integer. The percentage of positive class in the final dataset. It is used only in under-sampling. The default is 50. perc can not exceed 50.

k

integer. It is used only in over-sampling and SMOTE. If over-sampling and if K=0: sample with replacement from the minority class until we have the same number of instances in each class. under-sampling and if K>0: sample with replacement from the minority class until we have k-times the original number of minority instances. If SMOTE, the number of neighbours to consider as the pool from where the new examples are generated

perc.over

integer. It is used only in SMOTE. per.over/100 is the number of new instances generated for each rare instance. If perc.over < 100 a single instance is generated.

perc.under

integer. It is used only in SMOTE. perc.under/100 is the number of "normal" (majority class) instances that are randomly selected for each smoted observation.

Value

An object of train_df.

Details

In order to solve the problem of imbalanced class, sampling is performed by under sampling, over sampling, SMOTE method.

attributes of train_df class

The attributes of the train_df class are as follows.:

  • sample_seed : integer. random seed used for sampling

  • method : character. sampling methods.

  • perc : integer. perc argument value

  • k : integer. k argument value

  • perc.over : integer. perc.over argument value

  • perc.under : integer. perc.under argument value

  • binary : logical. whether the target variable is a binary class

  • target : character. target variable name

  • minority : character. the level of the minority class

  • majority : character. the level of the majority class

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

# under-sampling with random seed
under <- sb %>%
  sampling_target(seed = 1234L)

under %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes       233

# under-sampling with random seed, and minority class frequency is 40%
under40 <- sb %>%
  sampling_target(seed = 1234L, perc = 40)

under40 %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes       233

# over-sampling with random seed
over <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L)

over %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes      6767

# over-sampling with random seed, and k = 10
over10 <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L, k = 10)

over10 %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       6767
#> 2 Yes      2330

# SMOTE with random seed
smote <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L)

smote %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No        932
#> 2 Yes       699

# SMOTE with random seed, and perc.under = 250
smote250 <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250)

smote250 %>%
  count(default)
#> # A tibble: 2 × 2
#>   default     n
#>   <fct>   <int>
#> 1 No       1165
#> 2 Yes       699