How to Create an Excellent Reproducible Example in R
When discussing performance with colleagues, teaching, sending a bug report, or searching for guidance on mailing lists and Stack Overflow, a reproducible example is often requested and always helpful.
Key Questions:
- What are the best practices for creating an excellent reproducible example in R?
- How should you paste data structures from R in a text format?
- What additional information should be included in a reproducible example?
Technical Considerations:
- What are alternative methods to
dput(),dump(), orstructure()for sharing R code and data? - When should you include
library()orrequire()statements in your example? - Which reserved words should be avoided beyond
c,df,data, etc.?
Main Question:
How does one create a great reproducible example in R that effectively communicates the issue or concept to others?
Creating an excellent reproducible example in R requires careful attention to data structure representation, code context, and clear problem description. The best approach combines minimal code that reproduces the issue with properly formatted data and necessary environment information, ensuring others can run your code exactly as you intended without missing dependencies or encountering different results.
Contents
- What Makes a Reproducible Example Excellent?
- Data Representation Techniques
- Code Structure and Context
- Essential Environment Information
- Alternative Data Sharing Methods
- Common Pitfalls and Best Practices
- Putting It All Together: A Complete Example
What Makes a Reproducible Example Excellent?
An excellent reproducible example in R serves as a self-contained problem solver that others can run exactly as written to understand and address your issue. The key characteristics include:
Minimalism: The example should contain only the code necessary to reproduce the problem, eliminating unnecessary complexity and dependencies. This respects others’ time and makes debugging more efficient.
Clarity: The code should be well-structured with clear variable names and comments explaining the expected behavior versus the actual behavior. When someone reads your example, they should immediately understand what you’re trying to accomplish.
Portability: The example should work across different R environments without requiring special setup, unique data files, or specific R versions unless those are central to your issue.
Context: While minimal, the example should provide enough context about the problem domain and your intended approach to help others understand the broader implications of the issue.
A great reproducible example is like a scientific experiment - it must be repeatable under the same conditions to yield consistent results.
Data Representation Techniques
Properly representing data is crucial for reproducibility. When sharing data structures in text format, you have several effective methods at your disposal:
Using dput() for Small to Medium Data
The dput() function is your go-to tool for creating reproducible representations of R objects:
# Create a sample data frame
df <- data.frame(
x = 1:5,
y = letters[1:5],
z = rnorm(5)
)
# Use dput to create a text representation
dput(df)
# Output:
# structure(list(x = 1:5, y = structure(1:5, .Label = c("a", "b",
# "c", "d", "e"), class = "factor"), z = c(0.123, -0.456, 0.789,
# -0.234, 0.567)), .Names = c("x", "y", "z"), row.names = c(NA,
# -5L), class = "data.frame")
For larger datasets, use the dput(..., control = "all") option to preserve all attributes:
dput(df, control = "all")
Using dump() for Multiple Objects
When you need to share multiple related objects, dump() is more efficient:
# Create several objects
data1 <- matrix(1:9, nrow = 3)
data2 <- list(a = 1:3, b = "test")
result <- lm(mpg ~ wt, data = mtcars)
# Dump them to a file
dump(c("data1", "data2", "result"), file = "mydata.R")
# Or to the console
dump(c("data1", "data2"), "")
Using structure() for Direct Construction
For very simple objects, you can use structure() directly in your code:
my_df <- structure(list(
x = 1:5,
y = structure(c("a", "b", "c", "d", "e"), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
z = c(0.123, -0.456, 0.789, -0.234, 0.567)
), .Names = c("x", "y", "z"), row.names = c(NA, -5L), class = "data.frame")
Handling Large Datasets
For datasets that are too large for direct pasting:
- Use built-in datasets: Reference R’s built-in datasets like
mtcars,iris, orairquality - Create synthetic data: Generate similar data with the same structure and characteristics
- Share a subset: Use
head()orsample()to create a representative subset - Use external files: Provide a link to a publicly accessible file
# Example using built-in dataset
library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + geom_point()
# Example creating synthetic data
set.seed(123) # Important for reproducibility!
synthetic_data <- data.frame(
x = rnorm(100),
y = rnorm(100, mean = 2*x + 1),
group = sample(c("A", "B", "C"), 100, replace = TRUE)
)
Code Structure and Context
Including Library Statements
Always include library or require statements for packages your code depends on:
# Good practice
library(ggplot2)
library(dplyr)
# Your code here
However, avoid loading entire packages if you only need specific functions:
# Better for minimal examples
library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + geom_point()
# Even better - use specific functions
ggplot2::ggplot(mtcars, ggplot2::aes(wt, mpg)) + ggplot2::geom_point()
Variable Naming Conventions
Use clear, descriptive variable names that explain the data’s purpose:
# Good
customer_data <- read.csv("customers.csv")
analysis_results <- lm(spend ~ age + income, data = customer_data)
# Avoid - unclear names
df <- read.csv("customers.csv")
res <- lm(y ~ x1 + x2, data = df)
Code Organization
Structure your code with clear sections:
# 1. Load required packages
library(dplyr)
# 2. Create or load data
my_data <- data.frame(
category = c("A", "B", "A", "C"),
value = c(10, 20, 15, 25)
)
# 3. Reproduce the issue
# Expected: All values should be numeric
# Actual: Category 'C' values are missing
result <- my_data %>%
group_by(category) %>%
summarize(total = sum(value))
Reserved Words to Avoid
Beyond the obvious c, df, data, avoid other R reserved words:
if,else,for,while,repeat,function(as variable names)TRUE,FALSE,NULL,NA,NaN,Infbreak,next,inlibrary,require,source,data- Common function names like
sum,mean,sd,var,lm,glm
# Bad - using reserved words
if <- c(1, 2, 3) # This will cause errors
mean <- function(x) sum(x)/length(x) # Overwrites built-in function
# Good - using descriptive names
condition_vector <- c(1, 2, 3)
custom_mean <- function(x) sum(x)/length(x)
Essential Environment Information
R Session Information
Include your R version and package versions when relevant:
R.version.string
# [1] "R version 4.3.1 (2023-06-16) -- "Beagle Scouts""
sessionInfo()
# R version 4.3.1 (2023-06-16)
# Platform: x86_64-w64-microsoft-windows-gnu (64-bit)
# Running under: Windows 10 x64 (build 19045)
# Locale:
# LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
# LC_MONETARY=English_United States.1252 LC_NUMERIC=C
# LC_TIME=English_United States.1252
# Package version:
# ggplot2_3.4.3 dplyr_1.1.3 tidyr_1.3.0
Set Random Seeds
When your code involves random processes, set the random seed for reproducibility:
# Set seed before random operations
set.seed(123)
random_data <- rnorm(10)
# For multiple random processes, use different seeds
set.seed(456)
training_indices <- sample(1:100, 70)
set.seed(789)
test_indices <- sample(1:100, 30)
Operating System Specifics
Mention if your issue is platform-specific:
# This issue occurs on Windows but not macOS
# R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
# Running under: Windows 10 x64 (build 19045)
Alternative Data Sharing Methods
JSON and CSV Formats
For tabular data, CSV or JSON can be more readable:
# CSV format
data_text <- "x,y,z
1,a,0.123
2,b,-0.456
3,c,0.789
4,d,-0.234
5,e,0.567"
# Read from text
df <- read.csv(textConnection(data_text))
Custom Functions for Complex Objects
For complex objects, create functions to reconstruct them:
# Instead of pasting a large dput output
create_test_data <- function() {
data.frame(
id = 1:100,
value = rnorm(100),
category = sample(c("A", "B", "C"), 100, replace = TRUE),
stringsAsFactors = FALSE
)
}
# Use in example
test_data <- create_test_data()
Package Development Practices
For long-term reproducibility, consider package development practices:
# Use usethis for package structure
usethis::create_package("myreproducibleexample")
# Use devtools for version control
devtools::use_git()
devtools::use_github()
Common Pitfalls and Best Practices
Common Mistakes to Avoid
- Including unnecessary code: Strip out everything not needed to reproduce the issue
- Using absolute file paths: Always use relative paths or create data in code
- Forgetting to set random seeds: Random processes yield different results
- Using local datasets: Either create synthetic data or use built-in datasets
- Missing package dependencies: Always include library() statements
- Overwriting built-in functions: Avoid using reserved words as variable names
Best Practices Checklist
- [ ] Run your example from a fresh R session to ensure it works
- [ ] Remove all warnings and messages unless they’re relevant to the issue
- [ ] Test your example on a different machine if possible
- [ ] Include expected vs. actual output
- [ ] Use meaningful variable names
- [ ] Keep it as simple as possible while still reproducing the issue
- [ ] Add comments explaining what each section does
- [ ] Include a clear description of the problem
Performance Considerations
For performance-related questions, include system information:
# System information
Sys.info()
# Benchmark code
library(microbenchmark)
microbenchmark(
base_approach = sum(1:1000000),
dplyr_approach = dplyr::summarise(tibble(x = 1:1000000), total = sum(x))
)
Putting It All Together: A Complete Example
Here’s a template for an excellent reproducible example:
# Title: Issue with dplyr group_by and summarise when using factors
# Environment information
R.version.string
# [1] "R version 4.3.1 (2023-06-16) -- "Beagle Scouts""
packageVersion("dplyr")
# [1] '1.1.3'
# Load required packages
library(dplyr)
# Create reproducible data
set.seed(123) # For reproducibility
test_data <- data.frame(
id = 1:10,
category = sample(c("A", "B", "C"), 10, replace = TRUE),
value = rnorm(10),
stringsAsFactors = FALSE # Modern R default
)
# Expected behavior: Group by category and sum values
# Expected output should have 3 rows (one for each category)
# Actual behavior: Only 2 categories appear in results
# Issue reproduction
result <- test_data %>%
group_by(category) %>%
summarise(total_value = sum(value))
# Show the issue
print("Original data:")
print(test_data)
print("Grouped result:")
print(result)
print("Problem: Category 'C' is missing from results")
# Additional context: This happens specifically when using dplyr >= 1.1.0
# and the data has mixed character/factor columns
This example follows all best practices:
- Clear title and description
- Environment information
- Required packages loaded
- Reproducible data creation
- Expected vs. actual behavior explained
- Minimal code focused on the issue
- Comments explaining each step
Conclusion
Creating an excellent reproducible example in R is both an art and a science that requires attention to detail and consideration for your audience. The key takeaways include:
-
Minimize ruthlessly: Include only the code necessary to reproduce your issue, removing all unnecessary complexity and dependencies.
-
Represent data properly: Use
dput()for small datasets, create synthetic data for larger ones, and always set random seeds when working with random processes. -
Provide context: Include library statements, R session information, and clear descriptions of expected vs. actual behavior.
-
Follow naming conventions: Use descriptive variable names and avoid R reserved words to prevent conflicts with built-in functions.
-
Test thoroughly: Run your example from a fresh R session to ensure it works as intended before sharing it with others.
By following these practices, you’ll create reproducible examples that help others understand and solve your problems more efficiently, whether you’re reporting bugs, asking for help, or sharing knowledge with colleagues.
Remember that a well-crafted reproducible example not only solves your immediate problem but also serves as a learning resource for the entire R community. The time you invest in creating a clear, reproducible example will be rewarded with faster, more accurate, and more helpful responses from others.