NeuroAgent

Fix stat_regline_equation Mismatch in ggplot2

Resolve the discrepancy between regression equations and lines in ggplot2. Learn about orthogonal vs raw polynomials and implement proper quadratic regression in R with code examples.

Why doesn’t the regression equation from stat_regline_equation match the regression line from stat_smooth in R?

I’m encountering an issue where I’ve created a scatter plot with a quadratic regression line using ggplot2 in R. The regression line is plotted correctly using stat_smooth(), but the equation displayed by stat_regline_equation() doesn’t match the line.

Here’s the code I’m using:

r
#LIBRARIES
library(tidyverse)
library(ggpubr)
library(ggplot2)

test_db <- data.frame(
  X = c(0,-0.1297,-0.2185,-0.2795),
  Y = c(0.7569,0.7396,0.6561,0.5476)
)


test_plot = ggscatter(test_db, x = "X", y = "Y",
                     xlim = c(-1,0),
                     ylim = c(0, 1),
                     fill = "red",
                     shape = 23) +
  stat_regline_equation(label.x = -0.95, label.y = 0.15,  #print equation for regression line
                        formula = y ~ poly(x, 2),         #set regression line as quadratic
                        show.legend=FALSE) +
  stat_smooth(method = "lm",               #plot smoothed conditional mean
              formula = y ~ poly(x, 2),    #set regression line as quadratic
              colour="black", size=0.5,
              fill="red") + 
  theme_bw() + 
  labs(x = "X", 
       y = "Y",
       title = "Test plot")
test_plot

What could be causing this discrepancy between the regression line and the displayed equation?

NeuroAgent

The discrepancy between stat_regline_equation and stat_smooth in your code occurs because the poly() function in R uses orthogonal polynomials by default, which can cause coefficient interpretation issues even though the fitted regression lines should theoretically be identical.


Contents


Understanding the Poly() Function Issue

The root cause of your problem lies in how the poly() function handles polynomial terms. When you use formula = y ~ poly(x, 2) in both functions, R creates orthogonal polynomial terms by default, which can lead to unexpected coefficient interpretations.

According to the ggpubr documentation, stat_regline_equation uses the lm() function internally, but the way poly() is processed can differ between stat_smooth and stat_regline_equation in terms of default parameters.

The Stack Overflow question (R stat_regline_equation doesn’t match stat_smooth line) demonstrates this exact issue and shows that adding raw=TRUE resolves the problem.


Orthogonal vs Raw Polynomials

Let me clarify the key difference:

Orthogonal Polynomials (Default)

r
poly(x, 2)  # Uses orthogonal polynomials by default
  • Creates uncorrelated polynomial terms
  • Better for numerical stability
  • Coefficients represent contributions from orthogonalized terms
  • Harder to interpret directly

Raw Polynomials

r
poly(x, 2, raw = TRUE)  # Uses raw polynomial terms
  • Creates standard x and x² terms
  • Coefficients represent the expected mathematical relationship
  • Easier to interpret and match manual calculations

Important Note: Both approaches should produce the same fitted regression line, but the coefficient values will differ significantly. The issue in your case likely stems from how different ggplot2 layers handle the poly() function.


Differences Between stat_smooth and stat_regline_equation

While both functions use linear models internally, they have different default behaviors:

Function Default Formula Handling Model Fitting Output Focus
stat_smooth() Processes formula directly in ggplot2 context Fits model for visualization Produces smooth line with confidence intervals
stat_regline_equation() Processes formula through ggpubr’s lm wrapper Fits model for equation display Extracts coefficients and R² for text display

The GitHub source code for stat_regline_equation shows that it uses lm() but may have different parameter handling compared to ggplot2’s stat_smooth.


Solutions to Fix the Discrepancy

Solution 1: Use Raw=TRUE (Recommended)

Modify your code to explicitly use raw polynomials:

r
test_plot = ggscatter(test_db, x = "X", y = "Y",
                     xlim = c(-1,0),
                     ylim = c(0, 1),
                     fill = "red",
                     shape = 23) +
  stat_regline_equation(label.x = -0.95, label.y = 0.15,
                        formula = y ~ poly(x, 2, raw = TRUE),  # Use raw=TRUE
                        show.legend=FALSE) +
  stat_smooth(method = "lm",
              formula = y ~ poly(x, 2, raw = TRUE),  # Use raw=TRUE
              colour="black", size=0.5,
              fill="red") + 
  theme_bw() + 
  labs(x = "X", y = "Y", title = "Test plot")

Solution 2: Use Raw Polynomial Terms Directly

An alternative approach is to manually create polynomial terms:

r
test_db <- test_db %>%
  mutate(x2 = X^2)  # Create x² term manually

test_plot = ggscatter(test_db, x = "X", y = "Y",
                     xlim = c(-1,0),
                     ylim = c(0, 1),
                     fill = "red",
                     shape = 23) +
  stat_regline_equation(label.x = -0.95, label.y = 0.15,
                        formula = y ~ X + x2,  # Use raw terms directly
                        show.legend=FALSE) +
  stat_smooth(method = "lm",
              formula = y ~ X + x2,  # Use raw terms directly
              colour="black", size=0.5,
              fill="red") + 
  theme_bw() + 
  labs(x = "X", y = "Y", title = "Test plot")

Solution 3: Verify Model Consistency

You can also fit the models separately to verify they’re identical:

r
# Fit the model manually
manual_model <- lm(Y ~ poly(X, 2, raw = TRUE), data = test_db)
summary(manual_model)

# Compare with what stat_smooth produces
ggplot(test_db, aes(x = X, y = Y)) +
  geom_point() +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE))

Best Practices for Polynomial Regression in ggplot2

  1. Always specify raw = TRUE when using poly() in regression equations to ensure consistent interpretation

  2. Be cautious with small datasets - With only 4 data points as in your example, quadratic regression may be overfitting

  3. Consider the number of parameters - A quadratic model with 4 data points uses 3 parameters (intercept, linear, quadratic), leaving only 1 degree of freedom

  4. Validate your model - Check R² and p-values to ensure the model is meaningful

  5. Use consistent syntax - Ensure the same formula specification across all ggplot2 layers


Complete Working Example

Here’s a complete, corrected version of your code:

r
#LIBRARIES
library(tidyverse)
library(ggpubr)
library(ggplot2)

# Create test data
test_db <- data.frame(
  X = c(0, -0.1297, -0.2185, -0.2795),
  Y = c(0.7569, 0.7396, 0.6561, 0.5476)
)

# Create plot with consistent raw polynomial specification
test_plot <- ggscatter(test_db, x = "X", y = "Y",
                      xlim = c(-1, 0),
                      ylim = c(0, 1),
                      fill = "red",
                      shape = 23) +
  stat_regline_equation(label.x = -0.95, label.y = 0.15,
                        formula = y ~ poly(x, 2, raw = TRUE),  # Key fix: raw = TRUE
                        show.legend = FALSE) +
  stat_smooth(method = "lm",
              formula = y ~ poly(x, 2, raw = TRUE),  # Consistent specification
              colour = "black", size = 0.5,
              fill = "red") + 
  theme_bw() + 
  labs(x = "X", y = "Y", title = "Test plot with Quadratic Regression")

# Display the plot
test_plot

This code will now display both the regression line and equation that match perfectly, as both functions use the same raw polynomial specification.


Conclusion

The discrepancy between stat_regline_equation and stat_smooth in your ggplot2 code was caused by the default behavior of the poly() function using orthogonal polynomials. By explicitly setting raw = TRUE, you ensure both functions use the same polynomial basis, resulting in consistent regression lines and equations.

Key takeaways:

  1. Always specify raw = TRUE when using poly() in regression equations
  2. Ensure consistent formula specifications across all ggplot2 layers
  3. Be aware of the mathematical differences between orthogonal and raw polynomials
  4. Consider the appropriateness of polynomial regression for your dataset size

This solution addresses the immediate issue while following best practices for statistical visualization in R.


Sources

  1. ggpubr stat_regline_equation documentation
  2. Stack Overflow: stat_regline_equation doesn’t match stat_smooth line
  3. ggpubr GitHub source code - stat_regline_equation.R
  4. RDocumentation: stat_regline_equation function
  5. Quantargo: ggpubr stat_regline_equation