Preface
After you have acquired the data, you should do the following:
- Diagnose data quality.
- If there is a problem with data quality,
- The data must be corrected or re-acquired.
- Explore data to understand the data and find scenarios for
performing the analysis.
- Derive new variables or perform variable transformations.
The dlookr package makes these steps fast and easy:
- Performs a data diagnosis or automatically generates a data
diagnosis report.
- Discover data in various ways and automatically generate
EDA(exploratory data analysis) reports.
- Impute missing values and outliers, resolve skewed data, and
binaries continuous variables into categorical variables. And generates
an automated report to support it.
dlookr increases synergy with dplyr
. Particularly in
data exploration and data wrangling, it increases the efficiency of the
tidyverse
package group.
Supported data structures
Data diagnosis supports the following data structures.
- data frame: data.frame class.
- data table: tbl_df class.
- table of DBMS: table of the DBMS through tbl_dbi
- Use dplyr as the back-end interface for any DBI-compatible
database.
List of supported tasks of data analytics
Diagnose Data
Overall Diagnose Data
describe overview of data |
Inquire basic information to understand the data in
general |
overview() |
|
summary overview object |
summary described overview of data |
summary.overview() |
|
plot overview object |
plot described overview of data |
plot.overview() |
|
diagnose data quality of variables |
The scope of data quality diagnosis is information on
missing values and unique value information |
diagnose() |
x |
diagnose data quality of categorical variables |
frequency, ratio, rank by levels of each variables |
diagnose_category() |
x |
diagnose data quality of numerical variables |
descriptive statistics, number of zero, minus,
outliers |
diagnose_numeric() |
x |
diagnose data quality for outlier |
number of outliers, ratio, mean of outliers, mean with
outliers, mean without outliers |
diagnose_outlier() |
x |
plot outliers information of numerical data |
box plot and histogram whith outliers, without
outliers |
plot_outlier.data.frame() |
x |
plot outliers information of numerical data by target
variable |
box plot and density plot whith outliers, without
outliers |
plot_outlier.target_df() |
x |
diagnose combination of categorical variables |
Check for sparse cases of level combinations of
categorical variables |
diagnose_sparese() |
|
Visualize Missing Values
pareto chart for missing value |
visualize the Pareto chart for variables with a missing
value. |
plot_na_pareto() |
|
combination chart for missing value |
visualize the distribution of missing value by
combining variables. |
plot_na_hclust() |
|
plot the combination variables that is include missing
value |
visualize the combinations of missing value across
cases |
plot_na_intersect() |
|
Reporting
report the information of data diagnosis into a PDF
file |
report the information for diagnosing the data
quality |
diagnose_report() |
x |
reporting the information of data diagnosis into HTML
file |
report the information for diagnosing the quality of
the data |
diagnose_report() |
x |
reporting the information of data diagnosis into HTML
file |
dynamic report the information for diagnosing the
quality of the data |
diagnose_web_report() |
x |
reporting the information of data diagnosis into PDF
and HTML files |
paged report the information for diagnosing the quality
of the data |
diagnose_paged_report() |
x |
EDA
Normality Test
numerical |
summaries |
Shapiro-Wilk normality test |
normality() |
x |
numerical |
summaries |
normality diagnosis plot (histogram, Q-Q plots) |
plot_normality() |
x |
Relationship between target variable and predictors
Reporting
reporting the information of EDA into PDF file |
reporting the information of EDA |
eda_report() |
x |
reporting the information of EDA into HTML file |
reporting the information of EDA |
eda_report() |
x |
reporting the information of EDA into PDF file |
dynamic reporting the information of EDA |
eda_web_report() |
x |
reporting the information of EDA into HTML file |
paged reporting the information of EDA |
eda_paged_report() |
x |
Find Variables
missing values |
find the variable that contains the missing value in
the object that inherits the data.frame |
find_na() |
|
outliers |
find the numerical variable that contains outliers in
the object that inherits the data.frame |
find_outliers() |
|
skewed variable |
find the numerical variable that is the skewed variable
that inherits the data.frame |
find_skewness() |
|
Imputation
missing values |
missing values are imputed with some representative
values and statistical methods. |
imputate_na() |
|
outliers |
outliers are imputed with some representative values
and statistical methods. |
imputate_outlier() |
|
summaries |
calculate descriptive statistics of the original and
imputed values. |
summary.imputation() |
|
visualize |
the imputation of a numerical variable is a density
plot, and the imputation of a categorical variable is a bar plot. |
plot.imputation() |
|
Binning
binning |
converts a numeric variable to a categorization
variable |
binning() |
|
summaries |
calculate frequency and relative frequency for each
levels(bins) |
summary.bins() |
|
visualize |
visualize two plots on a single screen. The plot at the
top is a histogram representing the frequency of the level. The plot at
the bottom is a bar chart representing the frequency of the level. |
plot.bins() |
|
optimal binning |
categorizes a numeric characteristic into bins for
ulterior usage in scoring modeling |
binning_by() |
|
summaries |
summary metrics to evaluate the performance of binomial
classification model |
summary.optimal_bins() |
|
visualize |
generates plots for understand distribution, bad rate,
and weight of evidence after running binning_by() |
plot.optimal_bins() |
|
infogain binning |
categorizes a numeric characteristic into bins for
multi-class variables using recursive information gain ratio
maximization |
binning_rgr() |
|
visualize |
generates plots for understanding distribution and
distribution by target variable after running binning_rgr() |
plot.infogain_bins() |
|
evaluate |
calculates metrics to evaluate the performance of
binned variable for binomial classification model |
performance_bin() |
|
summaries |
summary metrics to evaluate the performance of binomial
classification model after performance_bin() |
summary.performance_bin() |
|
visualize |
It generates plots to understand frequency, WoE by bins
using performance_bin after running binning_by() |
plot.performance_bin() |
|
visualize |
extract bins from “bins” and “optimal_bins”
objects |
extract.bins() |
|
Diagnose Binned Variable
diagnosis |
performs diagnose performance that calculates metrics
to evaluate the performance of binned variable for binomial
classification model |
performance_bin() |
|
summaries |
summary method for “performance_bin”. summary metrics
to evaluate the performance of the binomial classification model |
summary.performance_bin() |
|
visualize |
visualize for understanding frequency, WoE by bins
using performance_bin and something else |
plot.performance_bin() |
|
transformation |
performs variable transformation for standardization
and resolving skewness of numerical variables |
transform() |
|
summaries |
compares the distribution of data before and after data
transformation |
summary.transform() |
|
visualize |
visualize two kinds of a plot by attribute of the
‘transform’ class. The transformation of a numerical variable is a
density plot |
plot.transform() |
|
Miscellaneous
Statistics
statistics |
calculate the entropy |
entropy() |
|
statistics |
calculate the skewness of the data |
skewness() |
|
statistics |
calculate the kurtosis of the data |
kurtosis() |
|
statistics |
calculate the Jensen-Shannon divergence between two
probability distributions |
jsd() |
|
statistics |
calculate the Kullback-Leibler divergence between two
probability distributions |
kld() |
|
statistics |
calculate the Cramer’s V statistic between two
categorical(discrete) variables |
cramer() |
|
statistics |
calculate the Theil’s U statistic between two
categorical(discrete) variables |
theil() |
|
statistics |
finding percentile of a numerical variable. |
get_percentile() |
|
statistics |
transform a numeric vector using several methods like
“log”, “sqrt”, “log+1”, “log+a”, “1/x”, “x^2”, “x^3”, “Box-Cox”,
“Yeo-Johnson” |
get_transform() |
|
statistics |
calculate the Cramer’s V statistic |
cramer() |
|
statistics |
calculate the Theil’s U statistic |
theil() |
|
Programming
programming |
extracts variable information having a certain class
from an object inheriting data.frame |
find_class() |
|
programming |
gets class of variables in data.frame or tbl_df |
get_class() |
|
programming |
retrieves the column information of the DBMS table
through the tbl_bdi object of dplyr |
get_column_info() |
|
programming |
finding the user machine’s OS. |
get_os() |
|
programming |
import Google fonts |
import_google_font() |
|