Why doesn’t the regression equation from stat_regline_equation match the regression line from stat_smooth in R?
I’m encountering an issue where I’ve created a scatter plot with a quadratic regression line using ggplot2 in R. The regression line is plotted correctly using stat_smooth(), but the equation displayed by stat_regline_equation() doesn’t match the line.
Here’s the code I’m using:
#LIBRARIES
library(tidyverse)
library(ggpubr)
library(ggplot2)
test_db <- data.frame(
X = c(0,-0.1297,-0.2185,-0.2795),
Y = c(0.7569,0.7396,0.6561,0.5476)
)
test_plot = ggscatter(test_db, x = "X", y = "Y",
xlim = c(-1,0),
ylim = c(0, 1),
fill = "red",
shape = 23) +
stat_regline_equation(label.x = -0.95, label.y = 0.15, #print equation for regression line
formula = y ~ poly(x, 2), #set regression line as quadratic
show.legend=FALSE) +
stat_smooth(method = "lm", #plot smoothed conditional mean
formula = y ~ poly(x, 2), #set regression line as quadratic
colour="black", size=0.5,
fill="red") +
theme_bw() +
labs(x = "X",
y = "Y",
title = "Test plot")
test_plot
What could be causing this discrepancy between the regression line and the displayed equation?
The discrepancy between stat_regline_equation and stat_smooth in your code occurs because the poly() function in R uses orthogonal polynomials by default, which can cause coefficient interpretation issues even though the fitted regression lines should theoretically be identical.
Contents
- Understanding the Poly() Function Issue
- Orthogonal vs Raw Polynomials
- Differences Between stat_smooth and stat_regline_equation
- Solutions to Fix the Discrepancy
- Best Practices for Polynomial Regression in ggplot2
- Complete Working Example
Understanding the Poly() Function Issue
The root cause of your problem lies in how the poly() function handles polynomial terms. When you use formula = y ~ poly(x, 2) in both functions, R creates orthogonal polynomial terms by default, which can lead to unexpected coefficient interpretations.
According to the ggpubr documentation, stat_regline_equation uses the lm() function internally, but the way poly() is processed can differ between stat_smooth and stat_regline_equation in terms of default parameters.
The Stack Overflow question (R stat_regline_equation doesn’t match stat_smooth line) demonstrates this exact issue and shows that adding raw=TRUE resolves the problem.
Orthogonal vs Raw Polynomials
Let me clarify the key difference:
Orthogonal Polynomials (Default)
poly(x, 2) # Uses orthogonal polynomials by default
- Creates uncorrelated polynomial terms
- Better for numerical stability
- Coefficients represent contributions from orthogonalized terms
- Harder to interpret directly
Raw Polynomials
poly(x, 2, raw = TRUE) # Uses raw polynomial terms
- Creates standard x and x² terms
- Coefficients represent the expected mathematical relationship
- Easier to interpret and match manual calculations
Important Note: Both approaches should produce the same fitted regression line, but the coefficient values will differ significantly. The issue in your case likely stems from how different ggplot2 layers handle the
poly()function.
Differences Between stat_smooth and stat_regline_equation
While both functions use linear models internally, they have different default behaviors:
| Function | Default Formula Handling | Model Fitting | Output Focus |
|---|---|---|---|
stat_smooth() |
Processes formula directly in ggplot2 context | Fits model for visualization | Produces smooth line with confidence intervals |
stat_regline_equation() |
Processes formula through ggpubr’s lm wrapper | Fits model for equation display | Extracts coefficients and R² for text display |
The GitHub source code for stat_regline_equation shows that it uses lm() but may have different parameter handling compared to ggplot2’s stat_smooth.
Solutions to Fix the Discrepancy
Solution 1: Use Raw=TRUE (Recommended)
Modify your code to explicitly use raw polynomials:
test_plot = ggscatter(test_db, x = "X", y = "Y",
xlim = c(-1,0),
ylim = c(0, 1),
fill = "red",
shape = 23) +
stat_regline_equation(label.x = -0.95, label.y = 0.15,
formula = y ~ poly(x, 2, raw = TRUE), # Use raw=TRUE
show.legend=FALSE) +
stat_smooth(method = "lm",
formula = y ~ poly(x, 2, raw = TRUE), # Use raw=TRUE
colour="black", size=0.5,
fill="red") +
theme_bw() +
labs(x = "X", y = "Y", title = "Test plot")
Solution 2: Use Raw Polynomial Terms Directly
An alternative approach is to manually create polynomial terms:
test_db <- test_db %>%
mutate(x2 = X^2) # Create x² term manually
test_plot = ggscatter(test_db, x = "X", y = "Y",
xlim = c(-1,0),
ylim = c(0, 1),
fill = "red",
shape = 23) +
stat_regline_equation(label.x = -0.95, label.y = 0.15,
formula = y ~ X + x2, # Use raw terms directly
show.legend=FALSE) +
stat_smooth(method = "lm",
formula = y ~ X + x2, # Use raw terms directly
colour="black", size=0.5,
fill="red") +
theme_bw() +
labs(x = "X", y = "Y", title = "Test plot")
Solution 3: Verify Model Consistency
You can also fit the models separately to verify they’re identical:
# Fit the model manually
manual_model <- lm(Y ~ poly(X, 2, raw = TRUE), data = test_db)
summary(manual_model)
# Compare with what stat_smooth produces
ggplot(test_db, aes(x = X, y = Y)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE))
Best Practices for Polynomial Regression in ggplot2
-
Always specify raw = TRUE when using
poly()in regression equations to ensure consistent interpretation -
Be cautious with small datasets - With only 4 data points as in your example, quadratic regression may be overfitting
-
Consider the number of parameters - A quadratic model with 4 data points uses 3 parameters (intercept, linear, quadratic), leaving only 1 degree of freedom
-
Validate your model - Check R² and p-values to ensure the model is meaningful
-
Use consistent syntax - Ensure the same formula specification across all ggplot2 layers
Complete Working Example
Here’s a complete, corrected version of your code:
#LIBRARIES
library(tidyverse)
library(ggpubr)
library(ggplot2)
# Create test data
test_db <- data.frame(
X = c(0, -0.1297, -0.2185, -0.2795),
Y = c(0.7569, 0.7396, 0.6561, 0.5476)
)
# Create plot with consistent raw polynomial specification
test_plot <- ggscatter(test_db, x = "X", y = "Y",
xlim = c(-1, 0),
ylim = c(0, 1),
fill = "red",
shape = 23) +
stat_regline_equation(label.x = -0.95, label.y = 0.15,
formula = y ~ poly(x, 2, raw = TRUE), # Key fix: raw = TRUE
show.legend = FALSE) +
stat_smooth(method = "lm",
formula = y ~ poly(x, 2, raw = TRUE), # Consistent specification
colour = "black", size = 0.5,
fill = "red") +
theme_bw() +
labs(x = "X", y = "Y", title = "Test plot with Quadratic Regression")
# Display the plot
test_plot
This code will now display both the regression line and equation that match perfectly, as both functions use the same raw polynomial specification.
Conclusion
The discrepancy between stat_regline_equation and stat_smooth in your ggplot2 code was caused by the default behavior of the poly() function using orthogonal polynomials. By explicitly setting raw = TRUE, you ensure both functions use the same polynomial basis, resulting in consistent regression lines and equations.
Key takeaways:
- Always specify
raw = TRUEwhen usingpoly()in regression equations - Ensure consistent formula specifications across all ggplot2 layers
- Be aware of the mathematical differences between orthogonal and raw polynomials
- Consider the appropriateness of polynomial regression for your dataset size
This solution addresses the immediate issue while following best practices for statistical visualization in R.