Cleansing the dataset

Cleansing the dataset for data analytics

Author

Affiliation

Choonghyun Ryu

 

Published

Nov. 30, 2021

DOI

Preface

If you created a dataset to create a classification model, you must perform cleansing of the data. After you create the dataset, you should do the following:

The alookr package makes these steps fast and easy:

Data: create example dataset

To illustrate basic use of the alookr package, create the data_exam with sample function. The data_exam dataset include 5 variables.

variables are as follows.:

# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
  paste(c(sample(letters, 5), x), collapse = ""))

year <- "2018"

set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)

set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)

set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)

data_exam <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)

# structure of dataset
str(data_exam)
'data.frame':   1000 obs. of  5 variables:
 $ id   : chr  "osncj1" "rvket2" "nvesi3" "chgji4" ...
 $ year : chr  "2018" "2018" "2018" "2018" ...
 $ count: int  3 3 10 2 6 5 4 6 9 10 ...
 $ alpha: chr  "o" "s" "n" "c" ...
 $ flag : chr  "N" "N" "N" "N" ...

# summary of dataset
summary(data_exam)
      id                year               count       
 Length:1000        Length:1000        Min.   : 1.000  
 Class :character   Class :character   1st Qu.: 3.000  
 Mode  :character   Mode  :character   Median : 6.000  
                                       Mean   : 5.698  
                                       3rd Qu.: 8.000  
                                       Max.   :10.000  
    alpha               flag          
 Length:1000        Length:1000       
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      

Clean dataset

cleanse() cleans up the dataset before fitting the classification model.

The function of cleanse() is as follows.:

Cleanse dataset with cleanse()

For example, we can cleanse all variables in data_exam:

# cleansing dataset
newDat <- cleanse(data_exam)
── Checking unique value ─────────────────────────── unique value is one ──
• year

── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)

── Checking character variables ─────────────────────── categorical data ──
• alpha
• flag

# structure of cleansing dataset
str(newDat)
'data.frame':   1000 obs. of  3 variables:
 $ count: int  3 3 10 2 6 5 4 6 9 10 ...
 $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
 $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

For example, we can not remove the categorical data that is removed by changing the threshold of the unique rate:

# cleansing dataset
newDat <- cleanse(data_exam, uniq_thres = 0.03)
── Checking unique value ─────────────────────────── unique value is one ──
• year

── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)

── Checking character variables ─────────────────────── categorical data ──
• alpha
• flag

# structure of cleansing dataset
str(newDat)
'data.frame':   1000 obs. of  3 variables:
 $ count: int  3 3 10 2 6 5 4 6 9 10 ...
 $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
 $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

The alpha variable was not removed.

If you do not want to apply a unique rate, you can set the value of the uniq argument to FALSE.:

# cleansing dataset
newDat <- cleanse(data_exam, uniq = FALSE)
── Checking character variables ─────────────────────── categorical data ──
• id
• year
• alpha
• flag

# structure of cleansing dataset
str(newDat)
'data.frame':   1000 obs. of  5 variables:
 $ id   : Factor w/ 1000 levels "ablnc282","abqym54",..: 594 715 558 94 727 270 499 882 930 515 ...
 $ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
 $ count: int  3 3 10 2 6 5 4 6 9 10 ...
 $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
 $ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...

If you do not want to force type conversion of a character variable to factor, you can set the value of the char argument to FALSE.:

# cleansing dataset
newDat <- cleanse(data_exam, char = FALSE)
── Checking unique value ─────────────────────────── unique value is one ──
• year

── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)

# structure of cleansing dataset
str(newDat)
'data.frame':   1000 obs. of  3 variables:
 $ count: int  3 3 10 2 6 5 4 6 9 10 ...
 $ alpha: chr  "o" "s" "n" "c" ...
 $ flag : chr  "N" "N" "N" "N" ...

If you want to remove a variable that contains missing values, specify the value of the missing argument as TRUE. The following example removes the flag variable that contains the missing value.

data_exam$flag[1] <- NA 

# cleansing dataset
newDat <- cleanse(data_exam, missing = TRUE)
── Checking missing value ────────────────────────────────── included NA ──
• flag

── Checking unique value ─────────────────────────── unique value is one ──
• year

── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)

── Checking character variables ─────────────────────── categorical data ──
• alpha

# structure of cleansing dataset
str(newDat)
'data.frame':   1000 obs. of  2 variables:
 $ count: int  3 3 10 2 6 5 4 6 9 10 ...
 $ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...

Diagnosis and removal of highly correlated variables

In the linear model, there is a multicollinearity if there is a strong correlation between independent variables. So it is better to remove one variable from a pair of variables where the correlation exists.

Even if it is not a linear model, removing one variable from a strongly correlated pair of variables can also reduce the overhead of the operation. It is also easy to interpret the model.

Cleanse dataset with treatment_corr()

treatment_corr() diagnose pairs of highly correlated variables or remove on of them.

treatment_corr() calculates correlation coefficient of pearson for numerical variable, and correlation coefficient of spearman for categorical variable.

For example, we can diagnosis and removal of highly correlated variables:

# numerical variable
x1 <- 1:100
set.seed(12L)
x2 <- sample(1:3, size = 100, replace = TRUE) * x1 + rnorm(1)
set.seed(1234L)
x3 <- sample(1:2, size = 100, replace = TRUE) * x1 + rnorm(1)

# categorical variable
x4 <- factor(rep(letters[1:20], time = 5))
set.seed(100L)
x5 <- factor(rep(letters[1:20 + sample(1:6, size = 20, replace = TRUE)], time = 5))
set.seed(200L)
x6 <- factor(rep(letters[1:20 + sample(1:3, size = 20, replace = TRUE)], time = 5))
set.seed(300L)
x7 <- factor(sample(letters[1:5], size = 100, replace = TRUE))

exam <- data.frame(x1, x2, x3, x4, x5, x6, x7)
str(exam)
'data.frame':   100 obs. of  7 variables:
 $ x1: int  1 2 3 4 5 6 7 8 9 10 ...
 $ x2: num  2.55 4.55 9.55 12.55 10.55 ...
 $ x3: num  0.194 2.194 4.194 6.194 3.194 ...
 $ x4: Factor w/ 20 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ x5: Factor w/ 13 levels "c","e","f","g",..: 1 5 3 2 4 7 6 8 9 8 ...
 $ x6: Factor w/ 15 levels "c","d","f","g",..: 1 2 3 4 3 5 6 7 8 9 ...
 $ x7: Factor w/ 5 levels "a","b","c","d",..: 2 2 1 4 5 1 4 3 1 5 ...
head(exam)
  x1        x2         x3 x4 x5 x6 x7
1  1  2.554297  0.1939687  a  c  c  b
2  2  4.554297  2.1939687  b  h  d  b
3  3  9.554297  4.1939687  c  f  f  a
4  4 12.554297  6.1939687  d  e  g  d
5  5 10.554297  3.1939687  e  g  f  e
6  6  6.554297 10.1939687  f  l  h  a

# default case
exam_01 <- treatment_corr(exam)
head(exam_01)
         x2         x3 x6 x7
1  2.554297  0.1939687  c  b
2  4.554297  2.1939687  d  b
3  9.554297  4.1939687  f  a
4 12.554297  6.1939687  g  d
5 10.554297  3.1939687  f  e
6  6.554297 10.1939687  h  a

# not removing variables
treatment_corr(exam, treat = FALSE)

# Set a threshold to detecting variables when correlation greater then 0.9
treatment_corr(exam, corr_thres = 0.9, treat = FALSE)

# not verbose mode
exam_02 <- treatment_corr(exam, verbose = FALSE)
head(exam_02)
         x2         x3 x6 x7
1  2.554297  0.1939687  c  b
2  4.554297  2.1939687  d  b
3  9.554297  4.1939687  f  a
4 12.554297  6.1939687  g  d
5 10.554297  3.1939687  f  e
6  6.554297 10.1939687  h  a

Footnotes