Comparison of numerical variables of train set and test set

Compare the statistics of the numerical variables of the train set and test set included in the "split_df" class.

compare_target_numeric(.data, ...)

Arguments

.data: an object of class "split_df", usually, a result of a call to split_df().
...: one or more unquoted expressions separated by commas. Select the numeric variable you want to compare. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, compare_target_numeric() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

Value

tbl_df. Variables for comparison:

variable : character. numeric variable name
train_mean : numeric. arithmetic mean of train set
test_mean : numeric. arithmetic mean of test set
train_sd : numeric. standard deviation of train set
test_sd : numeric. standard deviation of test set
train_z : numeric. the arithmetic mean of the train set divided by the standard deviation
test_z : numeric. the arithmetic mean of the test set divided by the standard deviation

Details

Compare the statistics of the numerical variables of the train set and the test set to determine whether the raw data is well separated into two data sets.

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)
#>   default student   balance    income
#> 1      No      No  729.5265 44361.625
#> 2      No     Yes  817.1804 12106.135
#> 3      No      No 1073.5492 31767.139
#> 4      No      No  529.2506 35704.494
#> 5      No      No  785.6559 38463.496
#> 6      No     Yes  919.5885  7491.559

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default)

sb %>%
  compare_target_numeric()
#> # A tibble: 2 × 7
#>   variable train_mean test_mean train_sd test_sd train_z test_z
#>   <chr>         <dbl>     <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
#> 1 balance        837.      832.     486.    478.    1.72   1.74
#> 2 income       33466.    33637.   13353.  13301.    2.51   2.53

sb %>%
  compare_target_numeric(balance)
#> # A tibble: 1 × 7
#>   variable train_mean test_mean train_sd test_sd train_z test_z
#>   <chr>         <dbl>     <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
#> 1 balance        837.      832.     486.    478.    1.72   1.74