Documenting Datasets

library(qtkit)
library(fs)
library(tibble)
library(dplyr)
library(glue)
library(readr)

Introduction

Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process:

Creating Dataset Origin Documentation

Basic Usage

Let’s start with documenting the built-in mtcars dataset:

# Create a temporary file for our documentation
origin_file <- file_temp(ext = "csv")

# Create the origin documentation template
origin_doc <- create_data_origin(
  file_path = origin_file,
  return = TRUE
)
#> Data origin file created at `file_path`.

# View the template
origin_doc |>
  glimpse()
#> Rows: 8
#> Columns: 2
#> $ attribute   <chr> "Resource name", "Data source", "Data sampling frame", "Da…
#> $ description <chr> "The name of the resource.", "URL, DOI, etc.", "Language, …

The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below.

Here’s how you might fill it out for mtcars:

origin_doc |>
  mutate(description = c(
    "Motor Trend Car Road Tests",
    "Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.",
    "US automobile market, passenger vehicles",
    "1973-74",
    "Built-in R dataset (.rda)",
    "Single data frame with 32 observations of 11 variables",
    "Public Domain",
    "Citation: Henderson and Velleman (1981)"
  )) |>
  write_csv(origin_file)

Customizing Origin Documentation

You can force overwrite existing documentation:

create_data_origin(
  file_path = origin_file,
  force = TRUE
)
#> Data origin file created at `file_path`.

Creating Data Dictionaries

Basic Dictionary Creation

Create a basic data dictionary without AI assistance:

# Create a temporary file for our dictionary
dict_file <- file_temp(ext = "csv")

# Generate dictionary for iris dataset
iris_dict <- create_data_dictionary(
  data = iris,
  file_path = dict_file
)

# View the results
iris_dict |>
  glimpse()
#> Rows: 5
#> Columns: 4
#> $ variable    <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Widt…
#> $ name        <chr> NA, NA, NA, NA, NA
#> $ type        <chr> "numeric", "numeric", "numeric", "numeric", "factor"
#> $ description <chr> NA, NA, NA, NA, NA

AI-Enhanced Data Dictionaries

If you have an OpenAI API key, you can generate more detailed descriptions:

# Not run - requires API key
Sys.setenv(OPENAI_API_KEY = "your-api-key")

iris_dict_ai <- create_data_dictionary(
  data = iris,
  file_path = dict_file,
  model = "gpt-4",
  sample_n = 5
)

Example output might look like:

#> # A tibble: 2 × 4
#>   variable     name         type    description                       
#>   <chr>        <chr>        <chr>   <chr>                             
#> 1 Sepal.Length Sepal Length numeric Length of the sepal in centimeters
#> 2 Sepal.Width  Sepal Width  numeric Width of the sepal in centimeters

Working with Larger Datasets

For larger datasets, you can use sampling and grouping:

diamonds_dict <- diamonds |>
  create_data_dictionary(
    file_path = "diamonds_dict.csv",
    model = "gpt-4",
    sample_n = 3,
    grouping = "cut" # Sample across different cut categories
  )

Best Practices

  1. Create documentation when first obtaining/creating a dataset
  2. Update documentation when:
    • Adding new variables
    • Modifying data structure
    • Changing data sources
  3. Store documentation alongside data in version control
  4. Include documentation paths in your project README

Conclusion

The {qtkit} package provides flexible tools for standardizing dataset documentation. By combining create_data_origin() and create_data_dictionary(), you can create comprehensive documentation that enhances reproducibility and data sharing.

Additional Resources