Missing values are imputed with some representative values and statistical methods.

imputate_na(.data, xvar, yvar, method, seed, print_flag, no_attrs)

Arguments

.data

a data.frame or a tbl_df.

xvar

variable name to replace missing value.

yvar

target variable.

method

method of missing values imputation.

seed

integer. the random seed used in mice. only used "mice" method.

print_flag

logical. If TRUE, mice will print running log on console. Use print_flag=FALSE for silent computation. Used only when method is "mice".

no_attrs

logical. If TRUE, return numerical variable or categorical variable. else If FALSE, imputation class.

Value

An object of imputation class. or numerical variable or categorical variable. if no_attrs is FALSE then return imputation class, else no_attrs is TRUE then return numerical vector or factor. Attributes of imputation class is as follows.

  • var_type : the data type of predictor to replace missing value.

  • method : method of missing value imputation.

    • predictor is numerical variable.

      • "mean" : arithmetic mean.

      • "median" : median.

      • "mode" : mode.

      • "knn" : K-nearest neighbors.

      • "rpart" : Recursive Partitioning and Regression Trees.

      • "mice" : Multivariate Imputation by Chained Equations.

    • predictor is categorical variable.

      • "mode" : mode.

      • "rpart" : Recursive Partitioning and Regression Trees.

      • "mice" : Multivariate Imputation by Chained Equations.

  • na_pos : position of missing value in predictor.

  • seed : the random seed used in mice. only used "mice" method.

  • type : "missing values". type of imputation.

  • message : a message tells you if the result was successful.

  • success : Whether the imputation was successful.

Details

imputate_na() creates an imputation class. The `imputation` class includes missing value position, imputed value, and method of missing value imputation, etc. The `imputation` class compares the imputed value with the original value to help determine whether the imputed value is used in the analysis.

See vignette("transformation") for an introduction to these concepts.

See also

Examples

# \donttest{ # Generate data for the example heartfailure2 <- heartfailure heartfailure2[sample(seq(NROW(heartfailure2)), 20), "platelets"] <- NA heartfailure2[sample(seq(NROW(heartfailure2)), 5), "smoking"] <- NA # Replace the missing value of the platelets variable with median imputate_na(heartfailure2, platelets, method = "median")
#> [1] 265000 263358 162000 259000 327000 204000 127000 454000 263358 388000 #> [11] 368000 253000 136000 276000 427000 47000 262000 166000 237000 87000 #> [21] 276000 297000 289000 368000 263358 149000 196000 284000 153000 200000 #> [31] 263358 360000 319000 302000 188000 228000 226000 321000 305000 329000 #> [41] 263358 153000 185000 218000 194000 310000 271000 451000 140000 395000 #> [51] 166000 418000 263358 351000 255000 461000 223000 216000 319000 254000 #> [61] 390000 216000 254000 385000 259000 119000 213000 274000 244000 497000 #> [71] 374000 122000 243000 149000 266000 204000 317000 237000 283000 324000 #> [81] 293000 263358 196000 172000 302000 406000 173000 304000 235000 181000 #> [91] 249000 297000 263358 210000 327000 219000 254000 255000 318000 221000 #> [101] 298000 259000 149000 226000 286000 621000 263000 226000 304000 850000 #> [111] 306000 228000 252000 351000 328000 164000 271000 507000 203000 259000 #> [121] 210000 162000 228000 259000 217000 237000 271000 300000 267000 227000 #> [131] 249000 250000 263358 295000 231000 263358 172000 305000 221000 211000 #> [141] 263358 348000 259000 229000 338000 266000 218000 242000 259000 228000 #> [151] 235000 244000 184000 263358 235000 194000 277000 262000 235000 362000 #> [161] 242000 174000 448000 75000 259000 192000 220000 70000 270000 305000 #> [171] 263358 325000 176000 189000 281000 337000 105000 259000 267000 279000 #> [181] 303000 221000 265000 224000 219000 389000 153000 365000 201000 275000 #> [191] 350000 309000 259000 160000 126000 223000 259000 259000 279000 263358 #> [201] 73000 377000 220000 212000 259000 362000 226000 186000 283000 268000 #> [211] 389000 147000 481000 244000 290000 203000 358000 151000 259000 371000 #> [221] 263358 194000 365000 130000 504000 265000 189000 141000 237000 274000 #> [231] 62000 185000 255000 330000 305000 406000 248000 173000 257000 263358 #> [241] 259000 249000 259000 220000 264000 282000 314000 246000 301000 223000 #> [251] 404000 259000 274000 236000 259000 334000 294000 253000 233000 308000 #> [261] 203000 283000 198000 208000 147000 362000 263358 133000 302000 222000 #> [271] 263358 221000 215000 189000 150000 422000 327000 25100 232000 451000 #> [281] 241000 51000 215000 259000 279000 336000 279000 543000 263358 390000 #> [291] 222000 133000 259000 179000 155000 270000 742000 259000 395000 #> attr(,"var_type") #> [1] "numerical" #> attr(,"method") #> [1] "median" #> attr(,"na_pos") #> [1] 4 65 102 120 124 143 149 165 178 193 197 205 219 241 243 252 255 284 293 #> [20] 298 #> attr(,"type") #> [1] "missing values" #> attr(,"message") #> [1] "complete imputation" #> attr(,"success") #> [1] TRUE #> attr(,"class") #> [1] "imputation" "numeric"
# Replace the missing value of the platelets variable with rpart # The target variable is death_event. # imputate_na(heartfailure2, platelets, death_event, method = "rpart") # Replace the missing value of the smoking variable with mode # imputate_na(heartfailure2, smoking, method = "mode") # Replace the missing value of the smoking variable with mice # The target variable is death_event. # imputate_na(heartfailure2, smoking, death_event, method = "mice") ## using dplyr ------------------------------------- library(dplyr) # The mean before and after the imputation of the platelets variable heartfailure2 %>% mutate(platelets_imp = imputate_na(heartfailure2, platelets, death_event, method = "knn", no_attrs = TRUE)) %>% group_by(death_event) %>% summarise(orig = mean(platelets, na.rm = TRUE), imputation = mean(platelets_imp))
#> # A tibble: 2 x 3 #> death_event orig imputation #> <fct> <dbl> <dbl> #> 1 No 266726. 266666. #> 2 Yes 256307. 256888.
# If the variable of interest is a numerical variable platelets <- imputate_na(heartfailure2, platelets, death_event, method = "rpart") platelets
#> [1] 265000.0 263358.0 162000.0 310000.0 327000.0 204000.0 127000.0 454000.0 #> [9] 263358.0 388000.0 368000.0 253000.0 136000.0 276000.0 427000.0 47000.0 #> [17] 262000.0 166000.0 237000.0 87000.0 276000.0 297000.0 289000.0 368000.0 #> [25] 263358.0 149000.0 196000.0 284000.0 153000.0 200000.0 263358.0 360000.0 #> [33] 319000.0 302000.0 188000.0 228000.0 226000.0 321000.0 305000.0 329000.0 #> [41] 263358.0 153000.0 185000.0 218000.0 194000.0 310000.0 271000.0 451000.0 #> [49] 140000.0 395000.0 166000.0 418000.0 263358.0 351000.0 255000.0 461000.0 #> [57] 223000.0 216000.0 319000.0 254000.0 390000.0 216000.0 254000.0 385000.0 #> [65] 404529.8 119000.0 213000.0 274000.0 244000.0 497000.0 374000.0 122000.0 #> [73] 243000.0 149000.0 266000.0 204000.0 317000.0 237000.0 283000.0 324000.0 #> [81] 293000.0 263358.0 196000.0 172000.0 302000.0 406000.0 173000.0 304000.0 #> [89] 235000.0 181000.0 249000.0 297000.0 263358.0 210000.0 327000.0 219000.0 #> [97] 254000.0 255000.0 318000.0 221000.0 298000.0 224951.0 149000.0 226000.0 #> [105] 286000.0 621000.0 263000.0 226000.0 304000.0 850000.0 306000.0 228000.0 #> [113] 252000.0 351000.0 328000.0 164000.0 271000.0 507000.0 203000.0 250667.2 #> [121] 210000.0 162000.0 228000.0 321250.0 217000.0 237000.0 271000.0 300000.0 #> [129] 267000.0 227000.0 249000.0 250000.0 263358.0 295000.0 231000.0 263358.0 #> [137] 172000.0 305000.0 221000.0 211000.0 263358.0 348000.0 250667.2 229000.0 #> [145] 338000.0 266000.0 218000.0 242000.0 224951.0 228000.0 235000.0 244000.0 #> [153] 184000.0 263358.0 235000.0 194000.0 277000.0 262000.0 235000.0 362000.0 #> [161] 242000.0 174000.0 448000.0 75000.0 231123.5 192000.0 220000.0 70000.0 #> [169] 270000.0 305000.0 263358.0 325000.0 176000.0 189000.0 281000.0 337000.0 #> [177] 105000.0 225333.3 267000.0 279000.0 303000.0 221000.0 265000.0 224000.0 #> [185] 219000.0 389000.0 153000.0 365000.0 201000.0 275000.0 350000.0 309000.0 #> [193] 277223.9 160000.0 126000.0 223000.0 404529.8 259000.0 279000.0 263358.0 #> [201] 73000.0 377000.0 220000.0 212000.0 184666.7 362000.0 226000.0 186000.0 #> [209] 283000.0 268000.0 389000.0 147000.0 481000.0 244000.0 290000.0 203000.0 #> [217] 358000.0 151000.0 277223.9 371000.0 263358.0 194000.0 365000.0 130000.0 #> [225] 504000.0 265000.0 189000.0 141000.0 237000.0 274000.0 62000.0 185000.0 #> [233] 255000.0 330000.0 305000.0 406000.0 248000.0 173000.0 257000.0 263358.0 #> [241] 381589.5 249000.0 264946.5 220000.0 264000.0 282000.0 314000.0 246000.0 #> [249] 301000.0 223000.0 404000.0 309484.2 274000.0 236000.0 224951.0 334000.0 #> [257] 294000.0 253000.0 233000.0 308000.0 203000.0 283000.0 198000.0 208000.0 #> [265] 147000.0 362000.0 263358.0 133000.0 302000.0 222000.0 263358.0 221000.0 #> [273] 215000.0 189000.0 150000.0 422000.0 327000.0 25100.0 232000.0 451000.0 #> [281] 241000.0 51000.0 215000.0 196341.7 279000.0 336000.0 279000.0 543000.0 #> [289] 263358.0 390000.0 222000.0 133000.0 196341.7 179000.0 155000.0 270000.0 #> [297] 742000.0 404529.8 395000.0 #> attr(,"var_type") #> [1] "numerical" #> attr(,"method") #> [1] "rpart" #> attr(,"na_pos") #> [1] 4 65 102 120 124 143 149 165 178 193 197 205 219 241 243 252 255 284 293 #> [20] 298 #> attr(,"type") #> [1] "missing values" #> attr(,"message") #> [1] "complete imputation" #> attr(,"success") #> [1] TRUE #> attr(,"class") #> [1] "imputation" "numeric"
summary(platelets)
#> * Impute missing values based on Recursive Partitioning and Regression Trees #> - method : rpart #> #> * Information of Imputation (before vs after) #> Original Imputation #> n 2.790000e+02 2.990000e+02 #> na 2.000000e+01 0.000000e+00 #> mean 2.632900e+05 2.642917e+05 #> sd 9.850920e+04 9.696201e+04 #> se_mean 5.897592e+03 5.607458e+03 #> IQR 9.250000e+04 9.100000e+04 #> skewness 1.480121e+00 1.449995e+00 #> kurtosis 6.359018e+00 6.286804e+00 #> p00 2.510000e+04 2.510000e+04 #> p01 5.958000e+04 6.178000e+04 #> p05 1.327000e+05 1.330000e+05 #> p10 1.530000e+05 1.590000e+05 #> p20 1.952000e+05 1.963417e+05 #> p25 2.115000e+05 2.140000e+05 #> p30 2.204000e+05 2.210000e+05 #> p40 2.362000e+05 2.362000e+05 #> p50 2.590000e+05 2.590000e+05 #> p60 2.660000e+05 2.668000e+05 #> p70 2.896000e+05 2.918000e+05 #> p75 3.040000e+05 3.050000e+05 #> p80 3.198000e+05 3.211000e+05 #> p90 3.746000e+05 3.822716e+05 #> p95 4.225000e+05 4.184000e+05 #> p99 5.601600e+05 5.445600e+05 #> p100 8.500000e+05 8.500000e+05
# plot(platelets) # If the variable of interest is a categorical variable # smoking <- imputate_na(heartfailure2, smoking, death_event, method = "mice") # smoking # summary(smoking) # plot(smoking) # }