icaoberg / Cleaning Up Data in R

Created Mon, 23 Dec 2024 23:26:30 -0500 Modified Tue, 26 Aug 2025 01:36:28 -0400

It all began in a computational statistics graduate class at Carnegie Mellon back in 2006. That was the year I was introduced to R—a language that seemed to hold the key to unlocking a deeper understanding of data and statistical modeling.

At the time, it wasn’t just about learning syntax; it was about immersing myself in a new way of thinking that merged computation and statistics in ways I hadn’t experienced before. R became a bridge between theory and practice, allowing me to test ideas quickly, visualize results, and explore datasets with curiosity and rigor.


R in Action

R has always been more than just a scripting language. In my early projects, every assignment felt like a puzzle, and every late-night coding session brought a small breakthrough: plotting a dataset, debugging a model, or uncovering the story hidden in numbers.

Fast-forward to today, and I still use R for exploratory work—especially when I need to wrangle and quickly understand messy data. One of my favorite workflows is pairing jsonlite for data ingestion with skimr for fast summaries.


Loading the Data

For example, let’s load a JSON file from the Brain Image Library inventory:

library(jsonlite)

url <- "https://download.brainimagelibrary.org/inventory/daily/reports/today.json"
json_data <- fromJSON(url)
df <- as.data.frame(json_data, flatten = TRUE)

Here, we flatten the JSON structure into a data.frame so it’s easier to analyze.

Installing and Using skimr

Next, install and load the skimr package (I still call it the “Swiss army knife” for quick dataset overviews):

install.packages("skimr")
library(skimr)

Now, with a single command:

skim(df)

You instantly get a comprehensive summary of your dataset, including row and column counts, missing values, variable types, and distribution statistics.

Example Output

Here’s a small excerpt from the skim(df) results on this dataset:

── Data Summary ────────────────────────
Name                       df    
Number of rows             7380  
Number of columns          27    

Column type frequency:           
  character                20    
  logical                  1     
  numeric                  6     

And for numeric columns:

── Variable type: numeric ───────────────────────────────────────────────────────
  skim_variable   n_missing complete_rate     mean       sd    p0          p25
1 size                   10         0.999 5.53e+11 5.03e+12 13606 431993323   
2 number_of_files         0         1     1.42e+ 4 1.66e+ 5     0         3   
3 md5_coverage            0         1     9.29e- 1 2.57e- 1     0         1   

With one glance, you can spot:

  • Missing values (n_missing)
  • Coverage and completeness of checksum calculations
  • File size distributions across datasets

This saves time compared to manually checking each column with summary(), str(), or custom scripts.

Why It Matters?

Data cleaning is often the least glamorous but most critical step in analysis. With tools like skimr, I can quickly assess data quality, identify inconsistencies, and decide what requires deeper preprocessing before running models or sharing results.

In projects like the Brain Image Library, where inventories contain thousands of datasets and dozens of metadata fields, this kind of quick triage is invaluable.

Takeaways

  • R has been part of my journey since 2006, and its ecosystem keeps growing stronger.
  • jsonlite + skimr is a lightweight workflow for quickly exploring structured data.
  • Data cleaning is never optional—but with the right tools, it doesn’t have to be painful.

If you’ve never tried skimr, I highly recommend adding it to your data science toolkit. It’s one of those packages you didn’t know you needed until you run it for the first time.