Preface
After you have acquired the data, you should do the following:
- Diagnose data quality.
- If there is a problem with data quality,
- The data must be corrected or re-acquired.
- Explore data to understand the data and find scenarios for
performing the analysis.
- Derive new variables or perform variable transformations.
The dlookr package makes these steps fast and easy:
- Performs a data diagnosis or automatically generates a data
diagnosis report.
- Discover data in various ways and automatically generate
EDA(exploratory data analysis) reports.
- Impute missing values and outliers, resolve skewed data, and
binaries continuous variables into categorical variables. And generates
an automated report to support it.
dlookr increases synergy with dplyr. Particularly in
data exploration and data wrangling, it increases the efficiency of the
tidyverse package group.
Supported data structures
Data diagnosis supports the following data structures.
- data frame: data.frame class.
- data table: tbl_df class.
- table of DBMS: table of the DBMS through tbl_dbi
- Use dplyr as the back-end interface for any DBI-compatible
database.
List of supported tasks of data analytics
Diagnose Data
Overall Diagnose Data
| describe overview of data |
Inquire basic information to understand the data in
general |
overview() |
|
| summary overview object |
summary described overview of data |
summary.overview() |
|
| plot overview object |
plot described overview of data |
plot.overview() |
|
| diagnose data quality of variables |
The scope of data quality diagnosis is information on
missing values and unique value information |
diagnose() |
x |
| diagnose data quality of categorical variables |
frequency, ratio, rank by levels of each variables |
diagnose_category() |
x |
| diagnose data quality of numerical variables |
descriptive statistics, number of zero, minus,
outliers |
diagnose_numeric() |
x |
| diagnose data quality for outlier |
number of outliers, ratio, mean of outliers, mean with
outliers, mean without outliers |
diagnose_outlier() |
x |
| plot outliers information of numerical data |
box plot and histogram whith outliers, without
outliers |
plot_outlier.data.frame() |
x |
| plot outliers information of numerical data by target
variable |
box plot and density plot whith outliers, without
outliers |
plot_outlier.target_df() |
x |
| diagnose combination of categorical variables |
Check for sparse cases of level combinations of
categorical variables |
diagnose_sparese() |
|
Visualize Missing Values
| pareto chart for missing value |
visualize the Pareto chart for variables with a missing
value. |
plot_na_pareto() |
|
| combination chart for missing value |
visualize the distribution of missing value by
combining variables. |
plot_na_hclust() |
|
| plot the combination variables that is include missing
value |
visualize the combinations of missing value across
cases |
plot_na_intersect() |
|
Reporting
| report the information of data diagnosis into a PDF
file |
report the information for diagnosing the data
quality |
diagnose_report() |
x |
| reporting the information of data diagnosis into HTML
file |
report the information for diagnosing the quality of
the data |
diagnose_report() |
x |
| reporting the information of data diagnosis into HTML
file |
dynamic report the information for diagnosing the
quality of the data |
diagnose_web_report() |
x |
| reporting the information of data diagnosis into PDF
and HTML files |
paged report the information for diagnosing the quality
of the data |
diagnose_paged_report() |
x |
EDA
Normality Test
| numerical |
summaries |
Shapiro-Wilk normality test |
normality() |
x |
| numerical |
summaries |
normality diagnosis plot (histogram, Q-Q plots) |
plot_normality() |
x |
Relationship between target variable and predictors
Reporting
| reporting the information of EDA into PDF file |
reporting the information of EDA |
eda_report() |
x |
| reporting the information of EDA into HTML file |
reporting the information of EDA |
eda_report() |
x |
| reporting the information of EDA into PDF file |
dynamic reporting the information of EDA |
eda_web_report() |
x |
| reporting the information of EDA into HTML file |
paged reporting the information of EDA |
eda_paged_report() |
x |
Find Variables
| missing values |
find the variable that contains the missing value in
the object that inherits the data.frame |
find_na() |
|
| outliers |
find the numerical variable that contains outliers in
the object that inherits the data.frame |
find_outliers() |
|
| skewed variable |
find the numerical variable that is the skewed variable
that inherits the data.frame |
find_skewness() |
|
Imputation
| missing values |
missing values are imputed with some representative
values and statistical methods. |
imputate_na() |
|
| outliers |
outliers are imputed with some representative values
and statistical methods. |
imputate_outlier() |
|
| summaries |
calculate descriptive statistics of the original and
imputed values. |
summary.imputation() |
|
| visualize |
the imputation of a numerical variable is a density
plot, and the imputation of a categorical variable is a bar plot. |
plot.imputation() |
|
Binning
| binning |
converts a numeric variable to a categorization
variable |
binning() |
|
| summaries |
calculate frequency and relative frequency for each
levels(bins) |
summary.bins() |
|
| visualize |
visualize two plots on a single screen. The plot at the
top is a histogram representing the frequency of the level. The plot at
the bottom is a bar chart representing the frequency of the level. |
plot.bins() |
|
| optimal binning |
categorizes a numeric characteristic into bins for
ulterior usage in scoring modeling |
binning_by() |
|
| summaries |
summary metrics to evaluate the performance of binomial
classification model |
summary.optimal_bins() |
|
| visualize |
generates plots for understand distribution, bad rate,
and weight of evidence after running binning_by() |
plot.optimal_bins() |
|
| infogain binning |
categorizes a numeric characteristic into bins for
multi-class variables using recursive information gain ratio
maximization |
binning_rgr() |
|
| visualize |
generates plots for understanding distribution and
distribution by target variable after running binning_rgr() |
plot.infogain_bins() |
|
| evaluate |
calculates metrics to evaluate the performance of
binned variable for binomial classification model |
performance_bin() |
|
| summaries |
summary metrics to evaluate the performance of binomial
classification model after performance_bin() |
summary.performance_bin() |
|
| visualize |
It generates plots to understand frequency, WoE by bins
using performance_bin after running binning_by() |
plot.performance_bin() |
|
| visualize |
extract bins from “bins” and “optimal_bins”
objects |
extract.bins() |
|
Diagnose Binned Variable
| diagnosis |
performs diagnose performance that calculates metrics
to evaluate the performance of binned variable for binomial
classification model |
performance_bin() |
|
| summaries |
summary method for “performance_bin”. summary metrics
to evaluate the performance of the binomial classification model |
summary.performance_bin() |
|
| visualize |
visualize for understanding frequency, WoE by bins
using performance_bin and something else |
plot.performance_bin() |
|
| transformation |
performs variable transformation for standardization
and resolving skewness of numerical variables |
transform() |
|
| summaries |
compares the distribution of data before and after data
transformation |
summary.transform() |
|
| visualize |
visualize two kinds of a plot by attribute of the
‘transform’ class. The transformation of a numerical variable is a
density plot |
plot.transform() |
|
Miscellaneous
Statistics
| statistics |
calculate the entropy |
entropy() |
|
| statistics |
calculate the skewness of the data |
skewness() |
|
| statistics |
calculate the kurtosis of the data |
kurtosis() |
|
| statistics |
calculate the Jensen-Shannon divergence between two
probability distributions |
jsd() |
|
| statistics |
calculate the Kullback-Leibler divergence between two
probability distributions |
kld() |
|
| statistics |
calculate the Cramer’s V statistic between two
categorical(discrete) variables |
cramer() |
|
| statistics |
calculate the Theil’s U statistic between two
categorical(discrete) variables |
theil() |
|
| statistics |
finding percentile of a numerical variable. |
get_percentile() |
|
| statistics |
transform a numeric vector using several methods like
“log”, “sqrt”, “log+1”, “log+a”, “1/x”, “x^2”, “x^3”, “Box-Cox”,
“Yeo-Johnson” |
get_transform() |
|
| statistics |
calculate the Cramer’s V statistic |
cramer() |
|
| statistics |
calculate the Theil’s U statistic |
theil() |
|
Programming
| programming |
extracts variable information having a certain class
from an object inheriting data.frame |
find_class() |
|
| programming |
gets class of variables in data.frame or tbl_df |
get_class() |
|
| programming |
retrieves the column information of the DBMS table
through the tbl_bdi object of dplyr |
get_column_info() |
|
| programming |
finding the user machine’s OS. |
get_os() |
|
| programming |
import Google fonts |
import_google_font() |
|