The binning_by() finding intervals for numerical variable using optical binning. Optimal binning categorizes a numeric characteristic into bins for ulterior usage in scoring modeling.

binning_by(.data, y, x, p = 0.05, ordered = TRUE, labels = NULL)

Arguments

.data

a data frame.

y

character. name of binary response variable(0, 1). The variable must contain only the integers 0 and 1 as element. However, in the case of factor having two levels, it is performed while type conversion is performed in the calculation process.

x

character. name of continuous characteristic variable. At least 5 different values. and Inf is not allowed.

p

numeric. percentage of records per bin. Default 5% (0.05). This parameter only accepts values greater that 0.00 (0%) and lower than 0.50 (50%).

ordered

logical. whether to build an ordered factor or not.

labels

character. the label names to use for each of the bins.

Value

an object of "optimal_bins" class. Attributes of "optimal_bins" class is as follows.

  • class : "optimal_bins".

  • type : binning type, "optimal".

  • breaks : numeric. the number of intervals into which x is to be cut.

  • levels : character. levels of binned value.

  • raw : numeric. raw data, x argument value.

  • ivtable : data.frame. information value table.

  • iv : numeric. information value.

  • target : integer. binary response variable.

Details

This function is useful when used with the mutate/transmute function of the dplyr package. And this function is implemented using smbinning() function of smbinning package.

attributes of "optimal_bins" class

Attributes of the "optimal_bins" class that is as follows.

  • class : "optimal_bins".

  • levels : character. factor or ordered factor levels

  • type : character. binning method

  • breaks : numeric. breaks for binning

  • raw : numeric. before the binned the raw data

  • ivtable : data.frame. information value table

  • iv : numeric. information value

  • target : integer. binary response variable

See vignette("transformation") for an introduction to these concepts.

See also

Examples

# \donttest{ library(dplyr) # Generate data for the example heartfailure2 <- heartfailure heartfailure2[sample(seq(NROW(heartfailure2)), 5), "creatinine"] <- NA # optimal binning using character bin <- binning_by(heartfailure2, "death_event", "creatinine")
#> Warning: The factor y has been changed to a numeric vector consisting of 0 and 1. #> 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
# optimal binning using name bin <- binning_by(heartfailure2, death_event, creatinine)
#> Warning: The factor y has been changed to a numeric vector consisting of 0 and 1. #> 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).
bin
#> binned type: optimal #> number of bins: 3 #> x #> [0.5,0.9] (0.9,1.8] (1.8,9.4] <NA> #> 80 167 47 5
# performance table attr(bin, "performance")
#> Bin CntRec CntPos CntNeg CntCumPos CntCumNeg RatePos RateNeg RateCumPos #> 1 [0.5,0.9] 80 9 71 9 71 0.09375 0.34975 0.09375 #> 2 (0.9,1.8] 167 52 115 61 186 0.54167 0.56650 0.63542 #> 3 (1.8,9.4] 47 34 13 95 199 0.35417 0.06404 0.98958 #> 4 <NA> 5 1 4 96 203 0.01042 0.01970 1.00000 #> 5 Total 299 96 203 NA NA 1.00000 1.00000 NA #> RateCumNeg Odds LnOdds WoE IV JSD AUC #> 1 0.34975 0.12676 -2.06546 -1.31660 0.33705 0.03933 0.01639 #> 2 0.91626 0.45217 -0.79369 -0.04483 0.00111 0.00014 0.20654 #> 3 0.98030 2.61538 0.96141 1.71027 0.49620 0.05542 0.05203 #> 4 1.00000 0.25000 -1.38629 -0.63744 0.00592 0.00073 0.01960 #> 5 NA 0.47291 -0.74886 NA 0.84028 0.09562 0.29457
# summary optimal_bins class summary(bin)
#> ── Binning Table ──────────────────────── Several Metrics ── #> Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE IV #> 1 [0.5,0.9] 80 9 71 0.09375 0.34975 0.12676 -1.31660 0.33705 #> 2 (0.9,1.8] 167 52 115 0.54167 0.56650 0.45217 -0.04483 0.00111 #> 3 (1.8,9.4] 47 34 13 0.35417 0.06404 2.61538 1.71027 0.49620 #> 4 <NA> 5 1 4 0.01042 0.01970 0.25000 -0.63744 0.00592 #> 5 Total 299 96 203 1.00000 1.00000 0.47291 NA 0.84028 #> JSD AUC #> 1 0.03933 0.01639 #> 2 0.00014 0.20654 #> 3 0.05542 0.05203 #> 4 0.00073 0.01960 #> 5 0.09562 0.29457 #> #> ── General Metrics ───────────────────────────────────────── #> • Gini index : -0.41087 #> • IV (Jeffrey) : 0.84028 #> • JS (Jensen-Shannon) Divergence : 0.09562 #> • Kolmogorov-Smirnov Statistics : 0.28084 #> • HHI (Herfindahl-Hirschman Index) : 0.40853 #> • HHI (normalized) : 0.21137 #> • Cramer's V : 0.41553 #> #> ── Significance Tests ──────────────────── Chisquare Test ── #> Bin A Bin B statistics p_value #> 1 [0.5,0.9] (0.9,1.8] 11.50352 6.946445e-04 #> 2 (0.9,1.8] (1.8,9.4] 25.90425 3.587780e-07 #>
# visualize all information for optimal_bins class plot(bin)
# visualize WoE information for optimal_bins class plot(bin, type = "WoE")
# visualize all information without typographic plot(bin, typographic = FALSE)
# extract binned results extract(bin) %>% head(20)
#> [1] (1.8,9.4] (0.9,1.8] (0.9,1.8] (1.8,9.4] (1.8,9.4] (1.8,9.4] (0.9,1.8] #> [8] (0.9,1.8] (0.9,1.8] (1.8,9.4] (1.8,9.4] [0.5,0.9] (0.9,1.8] (0.9,1.8] #> [15] (0.9,1.8] (0.9,1.8] [0.5,0.9] [0.5,0.9] (0.9,1.8] (1.8,9.4] #> Levels: [0.5,0.9] < (0.9,1.8] < (1.8,9.4]
# }