Automated exploratory data analysis for R. Point it at a data frame and it returns a structured profile — column types, missingness, distributional statistics, normality tests, outliers, correlations, a data-quality score, and ggplot2 figures — through a single function, profile_data().
The aim is to cover the first hour of EDA that you’d otherwise write by hand for every new dataset, while keeping the result a plain, inspectable object you can build on.
Installation
# install.packages("remotes")
remotes::install_github("mqfarooqi1/dataProfilerR")Depends on ggplot2. The Anderson–Darling normality test additionally uses the suggested nortest package; if it isn’t installed, only Shapiro–Wilk is run.
Quick start
library(dataProfilerR)
p <- profile_data(iris)
p # concise overview + quality score
summary(p) # numeric summary, missingness, normality, outliers, correlations
plot(p, which = "correlation") # retrieve a figure
plot(p, which = "distribution", column = "Sepal.Length")
# components are just list elements
p$metadata$column_types
p$diagnostics$quality$score
p$statistics$numeric
# grouped comparison + a self-contained HTML report (needs pandoc)
p <- profile_data(iris, group_by = "Species")
p$diagnostics$groups$numeric_by_group
report(p, "iris_report.html")See the vignette (vignette("dataProfilerR")) for a full walkthrough on a messy dataset.
Architecture and design decisions
The package is organised as a pipeline of independent, individually-callable functions, with one orchestrator on top:
profile_data() <- orchestrator
┌───────────────────┼───────────────────────────────┐
profiling statistics visualization
───────── ────────── ────────────
infer_column_types normality_tests plot_missing
analyze_missing detect_outliers / outlier_summary plot_distribution
summarize_columns correlation_analysis plot_correlation
data_quality_score plot_boxplots
plot_pairs
│
▼
data_profile (S3 object) ── print() / summary() / plot()
Design choices worth calling out:
-
S3, not S4. A profiling result is data, not behaviour. Modelling it as a plain list with a class keeps it transparent (
str(p)just works), serialisable, and easy to extend with new elements without redefining a formal class. S4’s validity and dispatch machinery would be overhead with no payoff here. The methods provided areprint,summary, andplot. -
Each stage stands alone.
infer_column_types(),detect_outliers(),plot_correlation()etc. all work directly on a data frame or vector, so the package is useful piecemeal, not only through the orchestrator. - Type inference drives the rest. Columns are classified once (numeric/integer/date/logical/categorical/text) and that classification routes which statistics and plots apply.
- Fail early on bad input. A shared validator rejects non-data-frames, empty frames, and duplicate/blank column names with clear messages rather than letting them surface as cryptic downstream errors.
-
Minimal dependencies. Only
ggplot2beyond base/recommended packages. Skewness and kurtosis are implemented directly rather than pulling inmoments; Anderson–Darling degrades gracefully whennortestis absent.
Function reference
Profiling
| Function | Purpose |
|---|---|
infer_column_types(df) |
Classify each column; character columns split into categorical vs text. |
analyze_missing(df) |
Per-column and overall missingness; complete-row count. |
summarize_columns(df) |
Numeric summary (mean, sd, variance, quartiles, IQR, skewness, kurtosis) and categorical cardinality / top level. |
data_quality_score(df) |
0–100 score and letter grade from completeness, row uniqueness, column variability, and (optionally) outlier rate. |
Statistics
| Function | Purpose |
|---|---|
normality_tests(df) |
Shapiro–Wilk (and Anderson–Darling if nortest is present) per numeric column; large columns subsampled to 5000. |
detect_outliers(x, method) |
"iqr", "zscore", or "robust" (median/MAD) on a vector. |
outlier_summary(df, method) |
Per-column outlier counts and an overall rate. |
correlation_analysis(df, method) |
Pearson and/or Spearman matrices over numeric columns. |
categorical_association(df) |
Cramer’s V matrix between categorical columns. |
analyze_dates(df) |
Range, unique count, and largest gap for date/datetime columns. |
compare_groups(df, group) |
Numeric summaries within the levels of a grouping column. |
skewness(x), kurtosis(x)
|
Moment-based, exported for direct use. |
Visualization (ggplot2)
| Function | Purpose |
|---|---|
plot_missing(df) |
Missing-value heatmap (rows subsampled when large). |
plot_distribution(df, column) |
Histogram + density (numeric) or bar chart (categorical). |
plot_correlation(df, method) |
Annotated correlation heatmap. |
plot_association(df) |
Cramer’s V heatmap for categorical columns. |
plot_boxplots(df) |
Faceted boxplots for the numeric columns. |
plot_pairs(df, columns) |
Scatterplot matrix for selected numeric columns. |
Pipeline, reporting & object
| Function | Purpose |
|---|---|
profile_data(df, ...) |
Run everything; return a data_profile. Options include group_by and distributions. |
report(x, file) |
Render the profile to a self-contained HTML file (needs pandoc). |
print / summary / plot methods |
Overview / detail / figures (plot() adds which = "association"). |
is_data_profile(x) |
Class predicate. |
The data_profile object
profile_data() returns an S3 list with four parts plus the call:
-
metadata— dataset name, dimensions, per-column types, type counts, timestamp. -
statistics— numeric summary, categorical summary, correlation matrices, and the categorical association matrix. -
diagnostics— missingness, normality, outliers, date-column profile, the grouped comparison (whengroup_byis set), and the quality score. -
plots— theggplot2objects (empty ifbuild_plots = FALSE; the per-column distribution plots are also skipped whendistributions = FALSE).
Folder structure
dataProfilerR/
├── DESCRIPTION
├── NAMESPACE # generated by roxygen2
├── LICENSE
├── NEWS.md
├── R/
│ ├── dataProfilerR-package.R
│ ├── utils.R # validation + skewness/kurtosis
│ ├── profiling.R # types, missingness, summaries, quality score
│ ├── statistics.R # normality, outliers, correlation
│ ├── association.R # Cramer's V for categoricals
│ ├── dates.R # date/datetime profiling
│ ├── groups.R # grouped comparison
│ ├── visualization.R # ggplot2 functions
│ ├── report.R # HTML report (rmarkdown)
│ ├── profile_data.R # orchestrator + S3 constructor
│ └── methods.R # print / summary / plot
├── man/ # generated by roxygen2
├── tests/testthat/ # unit + edge-case tests
└── vignettes/dataProfilerR.Rmd
Testing
testthat (edition 3) covers each function plus edge cases — empty frames, wrong types, all-NA columns, single-column frames, missing-column plot requests, and output-shape consistency. Run with devtools::test().
Limitations and future improvements
Added in 0.2.0: report() (HTML), categorical_association() (Cramer’s V), analyze_dates(), compare_groups(), and a distributions = FALSE switch to avoid eager per-column plots on wide data. See NEWS.md.
Still open / honest gaps:
-
Numeric-vs-categorical effect sizes (e.g. eta-squared, group-mean differences with tests) aren’t here yet;
compare_groups()reports descriptive summaries only, not significance. - Date analysis is shallow — range and gaps, but no seasonality/trend.
-
Distribution plots are still eager unless you opt out with
distributions = FALSE; a fully lazy, build-on-demand path would be cleaner. -
report()requires pandoc (the usual R Markdown dependency); there is no pandoc-free fallback. - Text columns are detected but not analysed beyond cardinality.