Compute Predictive Power Score

The pps() compute PPS(Predictive Power Score) for exploratory data analysis.

pps(.data, ...)

# S3 method for data.frame
pps(.data, ..., cv_folds = 5, do_parallel = FALSE, n_cores = -1)

# S3 method for target_df
pps(.data, ..., cv_folds = 5, do_parallel = FALSE, n_cores = -1)

Arguments

.data: a target_df or data.frame.
...: one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, describe() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.
cv_folds: integer. number of cross-validation folds.
do_parallel: logical. whether to perform score calls in parallel.
n_cores: integer. number of cores to use, defaults to maximum cores - 1.

Value

An object of the class as pps. Attributes of pps class is as follows.

type : type of pps
target : name of target variable
predictor : name of predictor

Details

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. The score ranges from 0 (no predictive power) to 1 (perfect predictive power).

Information of Predictive Power Score

The information of PPS is as follows.

x : the name of the predictor variable
y : the name of the target variable
result_type : text showing how to interpret the resulting score
pps : the predictive power score
metric : the evaluation metric used to compute the PPS
baseline_score : the score of a naive model on the evaluation metric
model_score : the score of the predictive model on the evaluation metric
cv_folds : how many cross-validation folds were used
seed : the seed that was set
algorithm : text shwoing what algorithm was used
model_type : text showing whether classification or regression was used

References

RIP correlation. Introducing the Predictive Power Score - by Florian Wetschoreck
- https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598

Examples

library(dplyr)

# If you want to use this feature, you need to install the 'ppsr' package.
if (!requireNamespace("ppsr", quietly = TRUE)) {
  cat("If you want to use this feature, you need to install the 'ppsr' package.\n")
}

# pps type is generic =======================================
pps_generic <- pps(iris)
pps_generic
#>               x            y                       result_type        pps
#> 1  Sepal.Length Sepal.Length predictor and target are the same 1.00000000
#> 2   Sepal.Width Sepal.Length            predictive power score 0.04632352
#> 3  Petal.Length Sepal.Length            predictive power score 0.54913985
#> 4   Petal.Width Sepal.Length            predictive power score 0.41276679
#> 5       Species Sepal.Length            predictive power score 0.40754872
#> 6  Sepal.Length  Sepal.Width            predictive power score 0.06790301
#> 7   Sepal.Width  Sepal.Width predictor and target are the same 1.00000000
#> 8  Petal.Length  Sepal.Width            predictive power score 0.23769911
#> 9   Petal.Width  Sepal.Width            predictive power score 0.21746588
#> 10      Species  Sepal.Width            predictive power score 0.20128762
#> 11 Sepal.Length Petal.Length            predictive power score 0.61608360
#> 12  Sepal.Width Petal.Length            predictive power score 0.24263851
#> 13 Petal.Length Petal.Length predictor and target are the same 1.00000000
#> 14  Petal.Width Petal.Length            predictive power score 0.79175121
#> 15      Species Petal.Length            predictive power score 0.79049070
#> 16 Sepal.Length  Petal.Width            predictive power score 0.48735314
#> 17  Sepal.Width  Petal.Width            predictive power score 0.20124105
#> 18 Petal.Length  Petal.Width            predictive power score 0.74378445
#> 19  Petal.Width  Petal.Width predictor and target are the same 1.00000000
#> 20      Species  Petal.Width            predictive power score 0.75611126
#> 21 Sepal.Length      Species            predictive power score 0.55918638
#> 22  Sepal.Width      Species            predictive power score 0.31344008
#> 23 Petal.Length      Species            predictive power score 0.91675800
#> 24  Petal.Width      Species            predictive power score 0.93985320
#> 25      Species      Species predictor and target are the same 1.00000000
#>         metric baseline_score model_score cv_folds seed algorithm
#> 1         <NA>             NA          NA       NA   NA      <NA>
#> 2          MAE      0.6893222   0.6620058        5    1      tree
#> 3          MAE      0.6893222   0.3100867        5    1      tree
#> 4          MAE      0.6893222   0.4040123        5    1      tree
#> 5          MAE      0.6893222   0.4076661        5    1      tree
#> 6          MAE      0.3372222   0.3184796        5    1      tree
#> 7         <NA>             NA          NA       NA   NA      <NA>
#> 8          MAE      0.3372222   0.2564258        5    1      tree
#> 9          MAE      0.3372222   0.2631636        5    1      tree
#> 10         MAE      0.3372222   0.2677963        5    1      tree
#> 11         MAE      1.5719667   0.5971445        5    1      tree
#> 12         MAE      1.5719667   1.1945031        5    1      tree
#> 13        <NA>             NA          NA       NA   NA      <NA>
#> 14         MAE      1.5719667   0.3265152        5    1      tree
#> 15         MAE      1.5719667   0.3280552        5    1      tree
#> 16         MAE      0.6623556   0.3377682        5    1      tree
#> 17         MAE      0.6623556   0.5315834        5    1      tree
#> 18         MAE      0.6623556   0.1684906        5    1      tree
#> 19        <NA>             NA          NA       NA   NA      <NA>
#> 20         MAE      0.6623556   0.1608119        5    1      tree
#> 21 F1_weighted      0.3176487   0.7028029        5    1      tree
#> 22 F1_weighted      0.3176487   0.5377587        5    1      tree
#> 23 F1_weighted      0.3176487   0.9404972        5    1      tree
#> 24 F1_weighted      0.3176487   0.9599148        5    1      tree
#> 25        <NA>             NA          NA       NA   NA      <NA>
#>        model_type
#> 1            <NA>
#> 2      regression
#> 3      regression
#> 4      regression
#> 5      regression
#> 6      regression
#> 7            <NA>
#> 8      regression
#> 9      regression
#> 10     regression
#> 11     regression
#> 12     regression
#> 13           <NA>
#> 14     regression
#> 15     regression
#> 16     regression
#> 17     regression
#> 18     regression
#> 19           <NA>
#> 20     regression
#> 21 classification
#> 22 classification
#> 23 classification
#> 24 classification
#> 25           <NA>

# pps type is target_by =====================================
##-----------------------------------------------------------
# If the target variable is a categorical variable
categ <- target_by(iris, Species)

# compute all variables
pps_cat <- pps(categ)
pps_cat
#>              x       y                       result_type       pps      metric
#> 1 Sepal.Length Species            predictive power score 0.5591864 F1_weighted
#> 2  Sepal.Width Species            predictive power score 0.3134401 F1_weighted
#> 3 Petal.Length Species            predictive power score 0.9167580 F1_weighted
#> 4  Petal.Width Species            predictive power score 0.9398532 F1_weighted
#> 5      Species Species predictor and target are the same 1.0000000        <NA>
#>   baseline_score model_score cv_folds seed algorithm     model_type
#> 1      0.3176487   0.7028029        5    1      tree classification
#> 2      0.3176487   0.5377587        5    1      tree classification
#> 3      0.3176487   0.9404972        5    1      tree classification
#> 4      0.3176487   0.9599148        5    1      tree classification
#> 5             NA          NA       NA   NA      <NA>           <NA>

# compute Petal.Length and Petal.Width variable
pps_cat <- pps(categ, Petal.Length, Petal.Width)
pps_cat
#>              x       y                       result_type       pps      metric
#> 1 Petal.Length Species            predictive power score 0.9167580 F1_weighted
#> 2  Petal.Width Species            predictive power score 0.9398532 F1_weighted
#> 3      Species Species predictor and target are the same 1.0000000        <NA>
#>   baseline_score model_score cv_folds seed algorithm     model_type
#> 1      0.3176487   0.9404972        5    1      tree classification
#> 2      0.3176487   0.9599148        5    1      tree classification
#> 3             NA          NA       NA   NA      <NA>           <NA>

# Using dplyr
pps_cat <- iris %>% 
  target_by(Species) %>% 
  pps()

pps_cat
#>              x       y                       result_type       pps      metric
#> 1 Sepal.Length Species            predictive power score 0.5591864 F1_weighted
#> 2  Sepal.Width Species            predictive power score 0.3134401 F1_weighted
#> 3 Petal.Length Species            predictive power score 0.9167580 F1_weighted
#> 4  Petal.Width Species            predictive power score 0.9398532 F1_weighted
#> 5      Species Species predictor and target are the same 1.0000000        <NA>
#>   baseline_score model_score cv_folds seed algorithm     model_type
#> 1      0.3176487   0.7028029        5    1      tree classification
#> 2      0.3176487   0.5377587        5    1      tree classification
#> 3      0.3176487   0.9404972        5    1      tree classification
#> 4      0.3176487   0.9599148        5    1      tree classification
#> 5             NA          NA       NA   NA      <NA>           <NA>

##-----------------------------------------------------------
# If the target variable is a numerical variable
num <- target_by(iris, Petal.Length)

pps_num <- pps(num)
pps_num
#>              x            y                       result_type       pps metric
#> 1 Sepal.Length Petal.Length            predictive power score 0.6160836    MAE
#> 2  Sepal.Width Petal.Length            predictive power score 0.2426385    MAE
#> 3 Petal.Length Petal.Length predictor and target are the same 1.0000000   <NA>
#> 4  Petal.Width Petal.Length            predictive power score 0.7917512    MAE
#> 5      Species Petal.Length            predictive power score 0.7904907    MAE
#>   baseline_score model_score cv_folds seed algorithm model_type
#> 1       1.571967   0.5971445        5    1      tree regression
#> 2       1.571967   1.1945031        5    1      tree regression
#> 3             NA          NA       NA   NA      <NA>       <NA>
#> 4       1.571967   0.3265152        5    1      tree regression
#> 5       1.571967   0.3280552        5    1      tree regression