Cleansing the dataset for data analytics
If you created a dataset to create a classification model, you must perform cleansing of the data. After you create the dataset, you should do the following:
The alookr package makes these steps fast and easy:
To illustrate basic use of the alookr package, create the data_exam
with sample function. The data_exam
dataset include 5 variables.
variables are as follows.:
id
: characteryear
: charactercount
: numericalpha
: characterflag
: character# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
paste(c(sample(letters, 5), x), collapse = ""))
year <- "2018"
set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)
set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)
set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)
data_exam <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
# structure of dataset
str(data_exam)
'data.frame': 1000 obs. of 5 variables:
$ id : chr "osncj1" "rvket2" "nvesi3" "chgji4" ...
$ year : chr "2018" "2018" "2018" "2018" ...
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: chr "o" "s" "n" "c" ...
$ flag : chr "N" "N" "N" "N" ...
# summary of dataset
summary(data_exam)
id year count
Length:1000 Length:1000 Min. : 1.000
Class :character Class :character 1st Qu.: 3.000
Mode :character Mode :character Median : 6.000
Mean : 5.698
3rd Qu.: 8.000
Max. :10.000
alpha flag
Length:1000 Length:1000
Class :character Class :character
Mode :character Mode :character
cleanse()
cleans up the dataset before fitting the classification model.
The function of cleanse() is as follows.:
cleanse()
For example, we can cleanse all variables in data_exam
:
# cleansing dataset
newDat <- cleanse(data_exam)
── Checking unique value ─────────────────────────── unique value is one ──
• year
── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)
── Checking character variables ─────────────────────── categorical data ──
• alpha
• flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
remove variables whose unique value is one
: The year variable has only one value, “2018”. Not needed when fitting the model. So it was removed.remove variables with high unique rate
: If the number of levels of categorical data is very large, it is not suitable for classification model. In this case, it is highly likely to be an identifier of the data. So, remove the categorical (or character) variable with a high value of the unique rate defined as “number of levels / number of observations”.
converts character variables to factor
: The character type flag variable is converted to a factor type.For example, we can not remove the categorical data that is removed by changing the threshold of the unique rate
:
# cleansing dataset
newDat <- cleanse(data_exam, uniq_thres = 0.03)
── Checking unique value ─────────────────────────── unique value is one ──
• year
── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)
── Checking character variables ─────────────────────── categorical data ──
• alpha
• flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
The alpha
variable was not removed.
If you do not want to apply a unique rate, you can set the value of the uniq
argument to FALSE.:
# cleansing dataset
newDat <- cleanse(data_exam, uniq = FALSE)
── Checking character variables ─────────────────────── categorical data ──
• id
• year
• alpha
• flag
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 5 variables:
$ id : Factor w/ 1000 levels "ablnc282","abqym54",..: 594 715 558 94 727 270 499 882 930 515 ...
$ year : Factor w/ 1 level "2018": 1 1 1 1 1 1 1 1 1 1 ...
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
$ flag : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 1 1 1 ...
If you do not want to force type conversion of a character variable to factor, you can set the value of the char
argument to FALSE.:
# cleansing dataset
newDat <- cleanse(data_exam, char = FALSE)
── Checking unique value ─────────────────────────── unique value is one ──
• year
── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 3 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: chr "o" "s" "n" "c" ...
$ flag : chr "N" "N" "N" "N" ...
If you want to remove a variable that contains missing values, specify the value of the missing
argument as TRUE. The following example removes the flag variable that contains the missing value.
data_exam$flag[1] <- NA
# cleansing dataset
newDat <- cleanse(data_exam, missing = TRUE)
── Checking missing value ────────────────────────────────── included NA ──
• flag
── Checking unique value ─────────────────────────── unique value is one ──
• year
── Checking unique rate ─────────────────────────────── high unique rate ──
• id = 1000(1)
── Checking character variables ─────────────────────── categorical data ──
• alpha
# structure of cleansing dataset
str(newDat)
'data.frame': 1000 obs. of 2 variables:
$ count: int 3 3 10 2 6 5 4 6 9 10 ...
$ alpha: Factor w/ 26 levels "a","b","c","d",..: 15 19 14 3 10 18 22 11 5 20 ...
In the linear model, there is a multicollinearity if there is a strong correlation between independent variables. So it is better to remove one variable from a pair of variables where the correlation exists.
Even if it is not a linear model, removing one variable from a strongly correlated pair of variables can also reduce the overhead of the operation. It is also easy to interpret the model.
treatment_corr()
treatment_corr()
diagnose pairs of highly correlated variables or remove on of them.
treatment_corr()
calculates correlation coefficient of pearson for numerical variable, and correlation coefficient of spearman for categorical variable.
For example, we can diagnosis and removal of highly correlated variables:
# numerical variable
x1 <- 1:100
set.seed(12L)
x2 <- sample(1:3, size = 100, replace = TRUE) * x1 + rnorm(1)
set.seed(1234L)
x3 <- sample(1:2, size = 100, replace = TRUE) * x1 + rnorm(1)
# categorical variable
x4 <- factor(rep(letters[1:20], time = 5))
set.seed(100L)
x5 <- factor(rep(letters[1:20 + sample(1:6, size = 20, replace = TRUE)], time = 5))
set.seed(200L)
x6 <- factor(rep(letters[1:20 + sample(1:3, size = 20, replace = TRUE)], time = 5))
set.seed(300L)
x7 <- factor(sample(letters[1:5], size = 100, replace = TRUE))
exam <- data.frame(x1, x2, x3, x4, x5, x6, x7)
str(exam)
'data.frame': 100 obs. of 7 variables:
$ x1: int 1 2 3 4 5 6 7 8 9 10 ...
$ x2: num 2.55 4.55 9.55 12.55 10.55 ...
$ x3: num 0.194 2.194 4.194 6.194 3.194 ...
$ x4: Factor w/ 20 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x5: Factor w/ 13 levels "c","e","f","g",..: 1 5 3 2 4 7 6 8 9 8 ...
$ x6: Factor w/ 15 levels "c","d","f","g",..: 1 2 3 4 3 5 6 7 8 9 ...
$ x7: Factor w/ 5 levels "a","b","c","d",..: 2 2 1 4 5 1 4 3 1 5 ...
head(exam)
x1 x2 x3 x4 x5 x6 x7
1 1 2.554297 0.1939687 a c c b
2 2 4.554297 2.1939687 b h d b
3 3 9.554297 4.1939687 c f f a
4 4 12.554297 6.1939687 d e g d
5 5 10.554297 3.1939687 e g f e
6 6 6.554297 10.1939687 f l h a
# default case
exam_01 <- treatment_corr(exam)
head(exam_01)
x2 x3 x6 x7
1 2.554297 0.1939687 c b
2 4.554297 2.1939687 d b
3 9.554297 4.1939687 f a
4 12.554297 6.1939687 g d
5 10.554297 3.1939687 f e
6 6.554297 10.1939687 h a
# not removing variables
treatment_corr(exam, treat = FALSE)
# Set a threshold to detecting variables when correlation greater then 0.9
treatment_corr(exam, corr_thres = 0.9, treat = FALSE)
# not verbose mode
exam_02 <- treatment_corr(exam, verbose = FALSE)
head(exam_02)
x2 x3 x6 x7
1 2.554297 0.1939687 c b
2 4.554297 2.1939687 d b
3 9.554297 4.1939687 f a
4 12.554297 6.1939687 g d
5 10.554297 3.1939687 f e
6 6.554297 10.1939687 h a
remove variables whose strong correlation
: x1, x4, x5 are removed.