NeuroAgent

How to Create Excellent Reproducible Examples in R

Learn how to create excellent reproducible examples in R with best practices for data representation, code structure, and environment information. Master the art of sharing reproducible R code effectively.

#data-representation #r-best-practices #r-code-sharing #r-data-import #r-debugging #reproducible-research #r-programming

10/25/2025, 12:02 PM

How to Create an Excellent Reproducible Example in R

When discussing performance with colleagues, teaching, sending a bug report, or searching for guidance on mailing lists and Stack Overflow, a reproducible example is often requested and always helpful.

Key Questions:

What are the best practices for creating an excellent reproducible example in R?
How should you paste data structures from R in a text format?
What additional information should be included in a reproducible example?

Technical Considerations:

What are alternative methods to dput(), dump(), or structure() for sharing R code and data?
When should you include library() or require() statements in your example?
Which reserved words should be avoided beyond c, df, data, etc.?

Main Question:

How does one create a great reproducible example in R that effectively communicates the issue or concept to others?

NeuroAgent

Creating an excellent reproducible example in R requires careful attention to data structure representation, code context, and clear problem description. The best approach combines minimal code that reproduces the issue with properly formatted data and necessary environment information, ensuring others can run your code exactly as you intended without missing dependencies or encountering different results.

What Makes a Reproducible Example Excellent?
Data Representation Techniques
Code Structure and Context
Essential Environment Information
Alternative Data Sharing Methods
Common Pitfalls and Best Practices
Putting It All Together: A Complete Example

What Makes a Reproducible Example Excellent?

An excellent reproducible example in R serves as a self-contained problem solver that others can run exactly as written to understand and address your issue. The key characteristics include:

Minimalism: The example should contain only the code necessary to reproduce the problem, eliminating unnecessary complexity and dependencies. This respects others’ time and makes debugging more efficient.

Clarity: The code should be well-structured with clear variable names and comments explaining the expected behavior versus the actual behavior. When someone reads your example, they should immediately understand what you’re trying to accomplish.

Portability: The example should work across different R environments without requiring special setup, unique data files, or specific R versions unless those are central to your issue.

Context: While minimal, the example should provide enough context about the problem domain and your intended approach to help others understand the broader implications of the issue.

A great reproducible example is like a scientific experiment - it must be repeatable under the same conditions to yield consistent results.

Data Representation Techniques

Properly representing data is crucial for reproducibility. When sharing data structures in text format, you have several effective methods at your disposal:

Using `dput()` for Small to Medium Data

The dput() function is your go-to tool for creating reproducible representations of R objects:

# Create a sample data frame
df <- data.frame(
  x = 1:5,
  y = letters[1:5],
  z = rnorm(5)
)

# Use dput to create a text representation
dput(df)
# Output:
# structure(list(x = 1:5, y = structure(1:5, .Label = c("a", "b", 
# "c", "d", "e"), class = "factor"), z = c(0.123, -0.456, 0.789, 
# -0.234, 0.567)), .Names = c("x", "y", "z"), row.names = c(NA, 
# -5L), class = "data.frame")

For larger datasets, use the dput(..., control = "all") option to preserve all attributes:

dput(df, control = "all")

Using `dump()` for Multiple Objects

When you need to share multiple related objects, dump() is more efficient:

# Create several objects
data1 <- matrix(1:9, nrow = 3)
data2 <- list(a = 1:3, b = "test")
result <- lm(mpg ~ wt, data = mtcars)

# Dump them to a file
dump(c("data1", "data2", "result"), file = "mydata.R")

# Or to the console
dump(c("data1", "data2"), "")

Using `structure()` for Direct Construction

For very simple objects, you can use structure() directly in your code:

my_df <- structure(list(
  x = 1:5,
  y = structure(c("a", "b", "c", "d", "e"), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
  z = c(0.123, -0.456, 0.789, -0.234, 0.567)
), .Names = c("x", "y", "z"), row.names = c(NA, -5L), class = "data.frame")

Handling Large Datasets

For datasets that are too large for direct pasting:

Use built-in datasets: Reference R’s built-in datasets like mtcars, iris, or airquality
Create synthetic data: Generate similar data with the same structure and characteristics
Share a subset: Use head() or sample() to create a representative subset
Use external files: Provide a link to a publicly accessible file

# Example using built-in dataset
library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + geom_point()

# Example creating synthetic data
set.seed(123)  # Important for reproducibility!
synthetic_data <- data.frame(
  x = rnorm(100),
  y = rnorm(100, mean = 2*x + 1),
  group = sample(c("A", "B", "C"), 100, replace = TRUE)
)

Code Structure and Context

Including Library Statements

Always include library or require statements for packages your code depends on:

# Good practice
library(ggplot2)
library(dplyr)

# Your code here

However, avoid loading entire packages if you only need specific functions:

# Better for minimal examples
library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + geom_point()

# Even better - use specific functions
ggplot2::ggplot(mtcars, ggplot2::aes(wt, mpg)) + ggplot2::geom_point()

Variable Naming Conventions

Use clear, descriptive variable names that explain the data’s purpose:

# Good
customer_data <- read.csv("customers.csv")
analysis_results <- lm(spend ~ age + income, data = customer_data)

# Avoid - unclear names
df <- read.csv("customers.csv")
res <- lm(y ~ x1 + x2, data = df)

Code Organization

Structure your code with clear sections:

# 1. Load required packages
library(dplyr)

# 2. Create or load data
my_data <- data.frame(
  category = c("A", "B", "A", "C"),
  value = c(10, 20, 15, 25)
)

# 3. Reproduce the issue
# Expected: All values should be numeric
# Actual: Category 'C' values are missing
result <- my_data %>%
  group_by(category) %>%
  summarize(total = sum(value))

Reserved Words to Avoid

Beyond the obvious c, df, data, avoid other R reserved words:

if, else, for, while, repeat, function (as variable names)
TRUE, FALSE, NULL, NA, NaN, Inf
break, next, in
library, require, source, data
Common function names like sum, mean, sd, var, lm, glm

# Bad - using reserved words
if <- c(1, 2, 3)  # This will cause errors
mean <- function(x) sum(x)/length(x)  # Overwrites built-in function

# Good - using descriptive names
condition_vector <- c(1, 2, 3)
custom_mean <- function(x) sum(x)/length(x)

Essential Environment Information

R Session Information

Include your R version and package versions when relevant:

R.version.string
# [1] "R version 4.3.1 (2023-06-16) -- "Beagle Scouts""

sessionInfo()
# R version 4.3.1 (2023-06-16)
# Platform: x86_64-w64-microsoft-windows-gnu (64-bit)
# Running under: Windows 10 x64 (build 19045)

# Locale:
#   LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252 
#   LC_MONETARY=English_United States.1252 LC_NUMERIC=C                   
#   LC_TIME=English_United States.1252    

# Package version:
#   ggplot2_3.4.3      dplyr_1.1.3      tidyr_1.3.0

Set Random Seeds

When your code involves random processes, set the random seed for reproducibility:

# Set seed before random operations
set.seed(123)
random_data <- rnorm(10)

# For multiple random processes, use different seeds
set.seed(456)
training_indices <- sample(1:100, 70)
set.seed(789)
test_indices <- sample(1:100, 30)

Operating System Specifics

Mention if your issue is platform-specific:

# This issue occurs on Windows but not macOS
# R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
# Running under: Windows 10 x64 (build 19045)

JSON and CSV Formats

For tabular data, CSV or JSON can be more readable:

# CSV format
data_text <- "x,y,z
1,a,0.123
2,b,-0.456
3,c,0.789
4,d,-0.234
5,e,0.567"

# Read from text
df <- read.csv(textConnection(data_text))

Custom Functions for Complex Objects

For complex objects, create functions to reconstruct them:

# Instead of pasting a large dput output
create_test_data <- function() {
  data.frame(
    id = 1:100,
    value = rnorm(100),
    category = sample(c("A", "B", "C"), 100, replace = TRUE),
    stringsAsFactors = FALSE
  )
}

# Use in example
test_data <- create_test_data()

Package Development Practices

For long-term reproducibility, consider package development practices:

# Use usethis for package structure
usethis::create_package("myreproducibleexample")

# Use devtools for version control
devtools::use_git()
devtools::use_github()

Common Pitfalls and Best Practices

Common Mistakes to Avoid

Including unnecessary code: Strip out everything not needed to reproduce the issue
Using absolute file paths: Always use relative paths or create data in code
Forgetting to set random seeds: Random processes yield different results
Using local datasets: Either create synthetic data or use built-in datasets
Missing package dependencies: Always include library() statements
Overwriting built-in functions: Avoid using reserved words as variable names

Best Practices Checklist

[ ] Run your example from a fresh R session to ensure it works
[ ] Remove all warnings and messages unless they’re relevant to the issue
[ ] Test your example on a different machine if possible
[ ] Include expected vs. actual output
[ ] Use meaningful variable names
[ ] Keep it as simple as possible while still reproducing the issue
[ ] Add comments explaining what each section does
[ ] Include a clear description of the problem

Performance Considerations

For performance-related questions, include system information:

# System information
Sys.info()

# Benchmark code
library(microbenchmark)
microbenchmark(
  base_approach = sum(1:1000000),
  dplyr_approach = dplyr::summarise(tibble(x = 1:1000000), total = sum(x))
)

Putting It All Together: A Complete Example

Here’s a template for an excellent reproducible example:

# Title: Issue with dplyr group_by and summarise when using factors

# Environment information
R.version.string
# [1] "R version 4.3.1 (2023-06-16) -- "Beagle Scouts""
packageVersion("dplyr")
# [1] '1.1.3'

# Load required packages
library(dplyr)

# Create reproducible data
set.seed(123)  # For reproducibility
test_data <- data.frame(
  id = 1:10,
  category = sample(c("A", "B", "C"), 10, replace = TRUE),
  value = rnorm(10),
  stringsAsFactors = FALSE  # Modern R default
)

# Expected behavior: Group by category and sum values
# Expected output should have 3 rows (one for each category)
# Actual behavior: Only 2 categories appear in results

# Issue reproduction
result <- test_data %>%
  group_by(category) %>%
  summarise(total_value = sum(value))

# Show the issue
print("Original data:")
print(test_data)
print("Grouped result:")
print(result)
print("Problem: Category 'C' is missing from results")

# Additional context: This happens specifically when using dplyr >= 1.1.0
# and the data has mixed character/factor columns

This example follows all best practices:

Clear title and description
Environment information
Required packages loaded
Reproducible data creation
Expected vs. actual behavior explained
Minimal code focused on the issue
Comments explaining each step

Conclusion

Creating an excellent reproducible example in R is both an art and a science that requires attention to detail and consideration for your audience. The key takeaways include:

Minimize ruthlessly: Include only the code necessary to reproduce your issue, removing all unnecessary complexity and dependencies.
Represent data properly: Use dput() for small datasets, create synthetic data for larger ones, and always set random seeds when working with random processes.
Provide context: Include library statements, R session information, and clear descriptions of expected vs. actual behavior.
Follow naming conventions: Use descriptive variable names and avoid R reserved words to prevent conflicts with built-in functions.
Test thoroughly: Run your example from a fresh R session to ensure it works as intended before sharing it with others.

By following these practices, you’ll create reproducible examples that help others understand and solve your problems more efficiently, whether you’re reporting bugs, asking for help, or sharing knowledge with colleagues.

Remember that a well-crafted reproducible example not only solves your immediate problem but also serves as a learning resource for the entire R community. The time you invest in creating a clear, reproducible example will be rewarded with faster, more accurate, and more helpful responses from others.

Sources

What are the most common mistakes when creating reproducible examples in R?How do I handle large datasets when creating reproducible examples in R?What's the difference between dput(), dump(), and structure() functions in R?How do I include package dependencies in a reproducible R example?What should I do when my reproducible example works on my machine but not others?How can I create reproducible examples for machine learning models in R?

Ask NeuroAgent