The cleanse() cleanse the dataset for classification modeling

# S3 method for data.frame
cleanse(
  .data,
  uniq = TRUE,
  uniq_thres = 0.1,
  char = TRUE,
  missing = FALSE,
  verbose = TRUE,
  ...
)

cleanse(.data, ...)

Arguments

.data

a data.frame or a tbl_df.

uniq

logical. Set whether to remove the variables whose unique value is one.

uniq_thres

numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.

char

logical. Set the change the character to factor.

missing

logical. Set whether to removing variables including missing value

verbose

logical. Set whether to echo information to the console at runtime.

...

further arguments passed to or from other methods.

Value

An object of data.frame or train_df. and return value is an object of the same type as the .data argument.

Details

This function is useful when fit the classification model. This function does the following.: Remove the variable with only one value. And remove variables that have a unique number of values relative to the number of observations for a character or categorical variable. In this case, it is a variable that corresponds to an identifier or an identifier. And converts the character to factor.

Examples

# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
  paste(c(sample(letters, 5), x), collapse = ""))

year <- "2018"

set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)

set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)

set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)

dat <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
# structure of dataset
str(dat)
#> 'data.frame':	1000 obs. of  5 variables:
#>  $ id   : chr  "osncj1" "rvket2" "nvesi3" "chgji4" ...
#>  $ year : chr  "2018" "2018" "2018" "2018" ...
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: chr  "o" "s" "n" "c" ...
#>  $ flag : chr  "N" "N" "N" "N" ...

# cleansing dataset
newDat <- cleanse(dat)
#> ── Checking unique value ─────────────────────────── unique value is one ──
#> remove variables that unique value is one
#>  year
#> 
#> ── Checking unique rate ─────────────────────────────── high unique rate ──
#> remove variables with high unique rate
#>  id = 1000(1)
#> 
#> ── Checking character variables ─────────────────────── categorical data ──
#> converts character variables to factor
#>  alpha
#>  flag
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  3 variables:
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
#>  $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

# cleansing dataset
newDat <- cleanse(dat, uniq = FALSE)
#> ── Checking character variables ─────────────────────── categorical data ──
#> converts character variables to factor
#>  id
#>  year
#>  alpha
#>  flag
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  5 variables:
#>  $ id   : Factor w/ 1000 levels "ablnc282","abqym54",..: 594 715 558 94 727 270 499 882 930 515 ...
#>  $ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
#>  $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

# cleansing dataset
newDat <- cleanse(dat, uniq_thres = 0.3)
#> ── Checking unique value ─────────────────────────── unique value is one ──
#> remove variables that unique value is one
#>  year
#> 
#> ── Checking unique rate ─────────────────────────────── high unique rate ──
#> remove variables with high unique rate
#>  id = 1000(1)
#> 
#> ── Checking character variables ─────────────────────── categorical data ──
#> converts character variables to factor
#>  alpha
#>  flag
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  3 variables:
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
#>  $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

# cleansing dataset
newDat <- cleanse(dat, char = FALSE)
#> ── Checking unique value ─────────────────────────── unique value is one ──
#> remove variables that unique value is one
#>  year
#> 
#> ── Checking unique rate ─────────────────────────────── high unique rate ──
#> remove variables with high unique rate
#>  id = 1000(1)
#> 

# structure of cleansing dataset
str(newDat)
#> 'data.frame':	1000 obs. of  3 variables:
#>  $ count: int  3 3 10 2 6 5 4 6 9 10 ...
#>  $ alpha: chr  "o" "s" "n" "c" ...
#>  $ flag : chr  "N" "N" "N" "N" ...