Cleansing the dataset for classification modeling

The cleanse() cleanse the dataset for classification modeling

# S3 method for data.frame
cleanse(
  .data,
  uniq = TRUE,
  uniq_thres = 0.1,
  char = TRUE,
  missing = FALSE,
  verbose = TRUE,
  ...
)

cleanse(.data, ...)

Arguments

.data: a data.frame or a tbl_df.
uniq: logical. Set whether to remove the variables whose unique value is one.
uniq_thres: numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.
char: logical. Set the change the character to factor.
missing: logical. Set whether to removing variables including missing value
verbose: logical. Set whether to echo information to the console at runtime.
...: further arguments passed to or from other methods.

Value

An object of data.frame or train_df. and return value is an object of the same type as the .data argument.

Details

This function is useful when fit the classification model. This function does the following.: Remove the variable with only one value. And remove variables that have a unique number of values relative to the number of observations for a character or categorical variable. In this case, it is a variable that corresponds to an identifier or an identifier. And converts the character to factor.

Examples

# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
  paste(c(sample(letters, 5), x), collapse = ""))

year <- "2018"

set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)

set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)

set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)

dat <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
# structure of dataset
str(dat)
#> 'data.frame':	1000 obs. of  5 variables:
#>  $ id   : chr  "osncj1" "rvket2" "nvesi3" "chgji4" ...
#>  $ year : chr  "2018" "2018" "2018" "2018" ...
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: chr  "o" "s" "n" "c" ...
#>  $ flag : chr  "N" "N" "N" "N" ...

# cleansing dataset
newDat <- cleanse(dat)
#> ── Checking unique value ─────────────────────────── unique value is one ──
#> remove variables that unique value is one
#> • year
#> 
#> ── Checking unique rate ─────────────────────────────── high unique rate ──
#> remove variables with high unique rate
#> • id = 1000(1)
#> 
#> ── Checking character variables ─────────────────────── categorical data ──
#> converts character variables to factor
#> • alpha
#> • flag
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  3 variables:
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
#>  $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

# cleansing dataset
newDat <- cleanse(dat, uniq = FALSE)
#> ── Checking character variables ─────────────────────── categorical data ──
#> converts character variables to factor
#> • id
#> • year
#> • alpha
#> • flag
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  5 variables:
#>  $ id   : Factor w/ 1000 levels "ablnc282","abqym54",..: 594 715 558 94 727 270 499 882 930 515 ...
#>  $ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
#>  $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

# cleansing dataset
newDat <- cleanse(dat, uniq_thres = 0.3)
#> ── Checking unique value ─────────────────────────── unique value is one ──
#> remove variables that unique value is one
#> • year
#> 
#> ── Checking unique rate ─────────────────────────────── high unique rate ──
#> remove variables with high unique rate
#> • id = 1000(1)
#> 
#> ── Checking character variables ─────────────────────── categorical data ──
#> converts character variables to factor
#> • alpha
#> • flag
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  3 variables:
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
#>  $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

# cleansing dataset
newDat <- cleanse(dat, char = FALSE)
#> ── Checking unique value ─────────────────────────── unique value is one ──
#> remove variables that unique value is one
#> • year
#> 
#> ── Checking unique rate ─────────────────────────────── high unique rate ──
#> remove variables with high unique rate
#> • id = 1000(1)
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  3 variables:
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: chr  "o" "s" "n" "c" ...
#>  $ flag : chr  "N" "N" "N" "N" ...