The pps() compute PPS(Predictive Power Score) for exploratory data analysis.

pps(.data, ...)

# S3 method for data.frame
pps(.data, ..., cv_folds = 5, do_parallel = FALSE, n_cores = -1)

# S3 method for target_df
pps(.data, ..., cv_folds = 5, do_parallel = FALSE, n_cores = -1)

Arguments

.data

a target_df or data.frame.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, describe() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

cv_folds

integer. number of cross-validation folds.

do_parallel

logical. whether to perform score calls in parallel.

n_cores

integer. number of cores to use, defaults to maximum cores - 1.

Value

An object of the class as pps. Attributes of pps class is as follows.

  • type : type of pps

  • target : name of target variable

  • predictor : name of predictor

Details

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. The score ranges from 0 (no predictive power) to 1 (perfect predictive power).

Information of Predictive Power Score

The information of PPS is as follows.

  • x : the name of the predictor variable

  • y : the name of the target variable

  • result_type : text showing how to interpret the resulting score

  • pps : the predictive power score

  • metric : the evaluation metric used to compute the PPS

  • baseline_score : the score of a naive model on the evaluation metric

  • model_score : the score of the predictive model on the evaluation metric

  • cv_folds : how many cross-validation folds were used

  • seed : the seed that was set

  • algorithm : text shwoing what algorithm was used

  • model_type : text showing whether classification or regression was used

References

  • RIP correlation. Introducing the Predictive Power Score - by Florian Wetschoreck

    • https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598

Examples

# \donttest{
library(dplyr)

# pps type is generic =======================================
pps_generic <- pps(iris)
pps_generic
#>               x            y                       result_type        pps
#> 1  Sepal.Length Sepal.Length predictor and target are the same 1.00000000
#> 2   Sepal.Width Sepal.Length            predictive power score 0.05700280
#> 3  Petal.Length Sepal.Length            predictive power score 0.52848518
#> 4   Petal.Width Sepal.Length            predictive power score 0.43360037
#> 5       Species Sepal.Length            predictive power score 0.40586730
#> 6  Sepal.Length  Sepal.Width            predictive power score 0.09404167
#> 7   Sepal.Width  Sepal.Width predictor and target are the same 1.00000000
#> 8  Petal.Length  Sepal.Width            predictive power score 0.25083405
#> 9   Petal.Width  Sepal.Width            predictive power score 0.24666074
#> 10      Species  Sepal.Width            predictive power score 0.21699768
#> 11 Sepal.Length Petal.Length            predictive power score 0.61917579
#> 12  Sepal.Width Petal.Length            predictive power score 0.17961947
#> 13 Petal.Length Petal.Length predictor and target are the same 1.00000000
#> 14  Petal.Width Petal.Length            predictive power score 0.78151887
#> 15      Species Petal.Length            predictive power score 0.79350708
#> 16 Sepal.Length  Petal.Width            predictive power score 0.48754118
#> 17  Sepal.Width  Petal.Width            predictive power score 0.14674017
#> 18 Petal.Length  Petal.Width            predictive power score 0.74123250
#> 19  Petal.Width  Petal.Width predictor and target are the same 1.00000000
#> 20      Species  Petal.Width            predictive power score 0.75318927
#> 21 Sepal.Length      Species            predictive power score 0.60444796
#> 22  Sepal.Width      Species            predictive power score 0.36017909
#> 23 Petal.Length      Species            predictive power score 0.91595246
#> 24  Petal.Width      Species            predictive power score 0.93554935
#> 25      Species      Species predictor and target are the same 1.00000000
#>         metric baseline_score model_score cv_folds seed algorithm
#> 1         <NA>             NA          NA       NA   NA      <NA>
#> 2          MAE      0.6874444   0.6625728        5    1      tree
#> 3          MAE      0.6874444   0.3219946        5    1      tree
#> 4          MAE      0.6874444   0.3869010        5    1      tree
#> 5          MAE      0.6874444   0.4069932        5    1      tree
#> 6          MAE      0.3398667   0.3117519        5    1      tree
#> 7         <NA>             NA          NA       NA   NA      <NA>
#> 8          MAE      0.3398667   0.2549879        5    1      tree
#> 9          MAE      0.3398667   0.2553413        5    1      tree
#> 10         MAE      0.3398667   0.2671294        5    1      tree
#> 11         MAE      1.5653222   0.5901695        5    1      tree
#> 12         MAE      1.5653222   1.2810047        5    1      tree
#> 13        <NA>             NA          NA       NA   NA      <NA>
#> 14         MAE      1.5653222   0.3352507        5    1      tree
#> 15         MAE      1.5653222   0.3202118        5    1      tree
#> 16         MAE      0.6587222   0.3298965        5    1      tree
#> 17         MAE      0.6587222   0.5623788        5    1      tree
#> 18         MAE      0.6587222   0.1677181        5    1      tree
#> 19        <NA>             NA          NA       NA   NA      <NA>
#> 20         MAE      0.6587222   0.1589949        5    1      tree
#> 21 F1_weighted      0.2506355   0.7053681        5    1      tree
#> 22 F1_weighted      0.2506355   0.5209557        5    1      tree
#> 23 F1_weighted      0.2506355   0.9396379        5    1      tree
#> 24 F1_weighted      0.2506355   0.9524872        5    1      tree
#> 25        <NA>             NA          NA       NA   NA      <NA>
#>        model_type
#> 1            <NA>
#> 2      regression
#> 3      regression
#> 4      regression
#> 5      regression
#> 6      regression
#> 7            <NA>
#> 8      regression
#> 9      regression
#> 10     regression
#> 11     regression
#> 12     regression
#> 13           <NA>
#> 14     regression
#> 15     regression
#> 16     regression
#> 17     regression
#> 18     regression
#> 19           <NA>
#> 20     regression
#> 21 classification
#> 22 classification
#> 23 classification
#> 24 classification
#> 25           <NA>

# summary pps class 
mat <- summary(pps_generic)
#> * PPS type : generic 
#> * Matrix of Predictive Power Score
#>   - Columns : target
#>   - Rows    : predictors
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> Sepal.Length   1.00000000   0.0570028    0.5284852   0.4336004 0.4058673
#> Sepal.Width    0.09404167   1.0000000    0.2508341   0.2466607 0.2169977
#> Petal.Length   0.61917579   0.1796195    1.0000000   0.7815189 0.7935071
#> Petal.Width    0.48754118   0.1467402    0.7412325   1.0000000 0.7531893
#> Species        0.60444796   0.3601791    0.9159525   0.9355494 1.0000000
mat
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> Sepal.Length   1.00000000   0.0570028    0.5284852   0.4336004 0.4058673
#> Sepal.Width    0.09404167   1.0000000    0.2508341   0.2466607 0.2169977
#> Petal.Length   0.61917579   0.1796195    1.0000000   0.7815189 0.7935071
#> Petal.Width    0.48754118   0.1467402    0.7412325   1.0000000 0.7531893
#> Species        0.60444796   0.3601791    0.9159525   0.9355494 1.0000000

# visualize pps class 
plot(pps_generic)



# pps type is target_by =====================================
##-----------------------------------------------------------
# If the target variable is a categorical variable
categ <- target_by(iris, Species)

# compute all variables
pps_cat <- pps(categ)
pps_cat
#>              x       y                       result_type       pps      metric
#> 1 Sepal.Length Species            predictive power score 0.6044480 F1_weighted
#> 2  Sepal.Width Species            predictive power score 0.3601791 F1_weighted
#> 3 Petal.Length Species            predictive power score 0.9159525 F1_weighted
#> 4  Petal.Width Species            predictive power score 0.9355494 F1_weighted
#> 5      Species Species predictor and target are the same 1.0000000        <NA>
#>   baseline_score model_score cv_folds seed algorithm     model_type
#> 1      0.2506355   0.7053681        5    1      tree classification
#> 2      0.2506355   0.5209557        5    1      tree classification
#> 3      0.2506355   0.9396379        5    1      tree classification
#> 4      0.2506355   0.9524872        5    1      tree classification
#> 5             NA          NA       NA   NA      <NA>           <NA>

# compute Petal.Length and Petal.Width variable
pps_cat <- pps(categ, Petal.Length, Petal.Width)
pps_cat
#>              x       y                       result_type       pps      metric
#> 1 Petal.Length Species            predictive power score 0.9159525 F1_weighted
#> 2  Petal.Width Species            predictive power score 0.9355494 F1_weighted
#> 3      Species Species predictor and target are the same 1.0000000        <NA>
#>   baseline_score model_score cv_folds seed algorithm     model_type
#> 1      0.2506355   0.9396379        5    1      tree classification
#> 2      0.2506355   0.9524872        5    1      tree classification
#> 3             NA          NA       NA   NA      <NA>           <NA>

# Using dplyr
pps_cat <- iris %>% 
  target_by(Species) %>% 
  pps()

pps_cat
#>              x       y                       result_type       pps      metric
#> 1 Sepal.Length Species            predictive power score 0.6044480 F1_weighted
#> 2  Sepal.Width Species            predictive power score 0.3601791 F1_weighted
#> 3 Petal.Length Species            predictive power score 0.9159525 F1_weighted
#> 4  Petal.Width Species            predictive power score 0.9355494 F1_weighted
#> 5      Species Species predictor and target are the same 1.0000000        <NA>
#>   baseline_score model_score cv_folds seed algorithm     model_type
#> 1      0.2506355   0.7053681        5    1      tree classification
#> 2      0.2506355   0.5209557        5    1      tree classification
#> 3      0.2506355   0.9396379        5    1      tree classification
#> 4      0.2506355   0.9524872        5    1      tree classification
#> 5             NA          NA       NA   NA      <NA>           <NA>

# Using parallel process
# pps_cat <- iris %>% 
#   target_by(Species) %>% 
#   pps(do_parallel = TRUE)
# 
# pps_cat

# summary pps class 
tab <- summary(pps_cat)
#> * PPS type : target_by 
#> * Target variable : Species 
#> * Model type : classification 
#> * Information of Predictive Power Score
#>     predictors  target       pps
#> 1      Species Species 1.0000000
#> 2  Petal.Width Species 0.9355494
#> 3 Petal.Length Species 0.9159525
#> 4 Sepal.Length Species 0.6044480
#> 5  Sepal.Width Species 0.3601791
tab
#>     predictors  target       pps
#> 1      Species Species 1.0000000
#> 2  Petal.Width Species 0.9355494
#> 3 Petal.Length Species 0.9159525
#> 4 Sepal.Length Species 0.6044480
#> 5  Sepal.Width Species 0.3601791

# visualize pps class
plot(pps_cat)


##-----------------------------------------------------------
# If the target variable is a numerical variable
num <- target_by(iris, Petal.Length)

pps_num <- pps(num)
pps_num
#>              x            y                       result_type       pps metric
#> 1 Sepal.Length Petal.Length            predictive power score 0.6191758    MAE
#> 2  Sepal.Width Petal.Length            predictive power score 0.1796195    MAE
#> 3 Petal.Length Petal.Length predictor and target are the same 1.0000000   <NA>
#> 4  Petal.Width Petal.Length            predictive power score 0.7815189    MAE
#> 5      Species Petal.Length            predictive power score 0.7935071    MAE
#>   baseline_score model_score cv_folds seed algorithm model_type
#> 1       1.565322   0.5901695        5    1      tree regression
#> 2       1.565322   1.2810047        5    1      tree regression
#> 3             NA          NA       NA   NA      <NA>       <NA>
#> 4       1.565322   0.3352507        5    1      tree regression
#> 5       1.565322   0.3202118        5    1      tree regression

# summary pps class 
tab <- summary(pps_num)
#> * PPS type : target_by 
#> * Target variable : Petal.Length 
#> * Model type : regression 
#> * Information of Predictive Power Score
#>     predictors       target       pps
#> 1 Petal.Length Petal.Length 1.0000000
#> 2      Species Petal.Length 0.7935071
#> 3  Petal.Width Petal.Length 0.7815189
#> 4 Sepal.Length Petal.Length 0.6191758
#> 5  Sepal.Width Petal.Length 0.1796195
tab
#>     predictors       target       pps
#> 1 Petal.Length Petal.Length 1.0000000
#> 2      Species Petal.Length 0.7935071
#> 3  Petal.Width Petal.Length 0.7815189
#> 4 Sepal.Length Petal.Length 0.6191758
#> 5  Sepal.Width Petal.Length 0.1796195

# plot pps class
plot(pps_num)

# }