The binning_by() finding intervals for numerical variable using optical binning. Optimal binning categorizes a numeric characteristic into bins for ulterior usage in scoring modeling.
binning_by(.data, y, x, p = 0.05, ordered = TRUE, labels = NULL)
.data | a data frame. |
---|---|
y | character. name of binary response variable(0, 1). The variable must contain only the integers 0 and 1 as element. However, in the case of factor having two levels, it is performed while type conversion is performed in the calculation process. |
x | character. name of continuous characteristic variable. At least 5 different values. and Inf is not allowed. |
p | numeric. percentage of records per bin. Default 5% (0.05). This parameter only accepts values greater that 0.00 (0%) and lower than 0.50 (50%). |
ordered | logical. whether to build an ordered factor or not. |
labels | character. the label names to use for each of the bins. |
an object of "optimal_bins" class. Attributes of "optimal_bins" class is as follows.
class : "optimal_bins".
type : binning type, "optimal".
breaks : numeric. the number of intervals into which x is to be cut.
levels : character. levels of binned value.
raw : numeric. raw data, x argument value.
ivtable : data.frame. information value table.
iv : numeric. information value.
target : integer. binary response variable.
This function is useful when used with the mutate/transmute function of the dplyr package. And this function is implemented using smbinning() function of smbinning package.
Attributes of the "optimal_bins" class that is as follows.
class : "optimal_bins".
levels : character. factor or ordered factor levels
type : character. binning method
breaks : numeric. breaks for binning
raw : numeric. before the binned the raw data
ivtable : data.frame. information value table
iv : numeric. information value
target : integer. binary response variable
See vignette("transformation") for an introduction to these concepts.
# \donttest{ library(dplyr) # Generate data for the example heartfailure2 <- heartfailure heartfailure2[sample(seq(NROW(heartfailure2)), 5), "creatinine"] <- NA # optimal binning using character bin <- binning_by(heartfailure2, "death_event", "creatinine")#> Warning: The factor y has been changed to a numeric vector consisting of 0 and 1. #> 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).# optimal binning using name bin <- binning_by(heartfailure2, death_event, creatinine)#> Warning: The factor y has been changed to a numeric vector consisting of 0 and 1. #> 'Yes' changed to 1 (positive) and 'No' changed to 0 (negative).bin#> binned type: optimal #> number of bins: 3 #> x #> [0.5,0.9] (0.9,1.8] (1.8,9.4] <NA> #> 80 167 47 5#> Bin CntRec CntPos CntNeg CntCumPos CntCumNeg RatePos RateNeg RateCumPos #> 1 [0.5,0.9] 80 9 71 9 71 0.09375 0.34975 0.09375 #> 2 (0.9,1.8] 167 52 115 61 186 0.54167 0.56650 0.63542 #> 3 (1.8,9.4] 47 34 13 95 199 0.35417 0.06404 0.98958 #> 4 <NA> 5 1 4 96 203 0.01042 0.01970 1.00000 #> 5 Total 299 96 203 NA NA 1.00000 1.00000 NA #> RateCumNeg Odds LnOdds WoE IV JSD AUC #> 1 0.34975 0.12676 -2.06546 -1.31660 0.33705 0.03933 0.01639 #> 2 0.91626 0.45217 -0.79369 -0.04483 0.00111 0.00014 0.20654 #> 3 0.98030 2.61538 0.96141 1.71027 0.49620 0.05542 0.05203 #> 4 1.00000 0.25000 -1.38629 -0.63744 0.00592 0.00073 0.01960 #> 5 NA 0.47291 -0.74886 NA 0.84028 0.09562 0.29457#> ── Binning Table ──────────────────────── Several Metrics ── #> Bin CntRec CntPos CntNeg RatePos RateNeg Odds WoE IV #> 1 [0.5,0.9] 80 9 71 0.09375 0.34975 0.12676 -1.31660 0.33705 #> 2 (0.9,1.8] 167 52 115 0.54167 0.56650 0.45217 -0.04483 0.00111 #> 3 (1.8,9.4] 47 34 13 0.35417 0.06404 2.61538 1.71027 0.49620 #> 4 <NA> 5 1 4 0.01042 0.01970 0.25000 -0.63744 0.00592 #> 5 Total 299 96 203 1.00000 1.00000 0.47291 NA 0.84028 #> JSD AUC #> 1 0.03933 0.01639 #> 2 0.00014 0.20654 #> 3 0.05542 0.05203 #> 4 0.00073 0.01960 #> 5 0.09562 0.29457 #> #> ── General Metrics ───────────────────────────────────────── #> • Gini index : -0.41087 #> • IV (Jeffrey) : 0.84028 #> • JS (Jensen-Shannon) Divergence : 0.09562 #> • Kolmogorov-Smirnov Statistics : 0.28084 #> • HHI (Herfindahl-Hirschman Index) : 0.40853 #> • HHI (normalized) : 0.21137 #> • Cramer's V : 0.41553 #> #> ── Significance Tests ──────────────────── Chisquare Test ── #> Bin A Bin B statistics p_value #> 1 [0.5,0.9] (0.9,1.8] 11.50352 6.946445e-04 #> 2 (0.9,1.8] (1.8,9.4] 25.90425 3.587780e-07 #>#> [1] (1.8,9.4] (0.9,1.8] (0.9,1.8] (1.8,9.4] (1.8,9.4] (1.8,9.4] (0.9,1.8] #> [8] (0.9,1.8] (0.9,1.8] (1.8,9.4] (1.8,9.4] [0.5,0.9] (0.9,1.8] (0.9,1.8] #> [15] (0.9,1.8] (0.9,1.8] [0.5,0.9] [0.5,0.9] (0.9,1.8] (1.8,9.4] #> Levels: [0.5,0.9] < (0.9,1.8] < (1.8,9.4]# }