How to properly implement Horn’s Parallel Analysis to determine the number of principal components in PCA based on a covariance matrix?
I’m attempting to perform a parallel analysis (Horn, 1965) to determine the number of meaningful principal components in a PCA based on a covariance matrix (not a correlation matrix). I’ve tried using the nFactors package but haven’t been able to interpret or implement it correctly. When I ran the analysis, all the empirical eigenvalues were lower than the simulated ones, which seems incorrect.
What troubleshooting steps should I try next? Here’s my current implementation:
# Code for performing Horn's Parallel Analysis for PCA on Covariance Matrix
# Load necessary libraries ####
library(openxlsx)
library(nFactors)
# Importing the dataset
data <- read.xlsx("data.xlsx")
# Number of observations and variables
n_subjects <- nrow(data)
n_variables <- ncol(data)
# Calculate the covariance matrix
cov_mat <- cov(data, use = "pairwise.complete.obs")
cov_mat
# Calculate the standard deviation of variables
sd_vec <- sqrt(diag(cov_mat))
sd_vector <- as.vector(sd_vec)
# Calculating the eigenvalues
eigen_cov <- eigen(cov_mat)$values
eigen_cov <- as.vector(eigen_cov)
# Performing the parallel analysis
# Quantile of the eigenvalues distribution
sim_qevpea <- parallel(
var = n_variables,
subject = n_subjects,
rep = 1000,
quantile = 0.95,
sd = diag(sd_vector))$eigen$qevpea
# Mean of the eigenvalues distribution
sim_mevpea <- parallel(
var = n_variables,
subject = n_subjects,
rep = 1000,
quantil = 0.95,
sd = diag(sd_vector))$eigen$mevpea
# Merging results into a dataframe
sim_mevpea <- as.data.frame(sim_mevpea)
sim_qevpea <- as.data.frame(sim_qevpea)
obs_eigenvalues <- as.data.frame(eigen_cov)
sim_eig_obs_eig <- cbind(sim_mevpea, sim_qevpea, obs_eigenvalues)
# Results
sim_mevpea sim_qevpea eigen_empiric
1 0.016609276 0.018170467 2.755510e-04
2 0.014469410 0.015579270 2.351656e-04
3 0.012966035 0.013853537 1.452651e-04
4 0.011738829 0.012546877 1.289419e-04
5 0.010746659 0.011467116 1.105977e-04
6 0.009892356 0.010519131 8.850754e-05
7 0.009101438 0.009717212 6.505378e-05
8 0.008403761 0.008964388 5.331106e-05
9 0.007787579 0.008299819 4.988057e-05
10 0.007208755 0.007668394 3.628711e-05
11 0.006675843 0.007149570 2.814191e-05
12 0.006194959 0.006593400 2.418505e-05
13 0.005733585 0.006138405 1.627334e-05
14 0.005284943 0.005671669 1.355602e-05
15 0.004858114 0.005257232 1.123874e-05
16 0.004453399 0.004805642 9.359437e-06
17 0.004045374 0.004410657 7.997094e-06
18 0.003625045 0.004021005 7.779925e-06
19 0.003190694 0.003575597 1.775028e-06
20 0.002681099 0.003138437 6.740691e-07
Is there an issue with my implementation? Are there alternative approaches or corrections I should consider?
Horn’s Parallel Analysis for PCA on covariance matrices requires careful implementation due to the fundamental differences between covariance and correlation matrices. The issue with your implementation is likely related to how eigenvalues are calculated and compared between empirical and simulated data when working with covariance matrices.
When working with covariance matrices, eigenvalues can be significantly larger than those from correlation matrices, and the simulation approach must account for the actual scale and variance structure of your data. The extremely small eigenvalues you’re observing suggest a scaling issue in your implementation.
Contents
- Understanding the Problem with Your Implementation
- Correct Implementation Steps
- Alternative Approaches
- Troubleshooting Common Issues
- Complete R Code Example
- Interpreting Results
- Sources
Understanding the Problem with Your Implementation
Your current implementation has several issues that explain why all empirical eigenvalues are lower than simulated ones:
- Parameter Typo: In your second
parallel()call, you usedquantilinstead ofquantile - Eigenvalue Scale Issue: The eigenvalues from covariance matrices are typically much larger than those from correlation matrices, but your results show extremely small values
- Standard Deviation Specification: The
sdparameter needs to be properly specified for covariance matrix analysis - Data Scaling: The eigenvalues suggest your data might be on a very small scale or there’s a normalization issue
Correct Implementation Steps
Here’s how to properly implement Horn’s Parallel Analysis for covariance matrices:
1. Prepare Your Data Correctly
# Load necessary libraries
library(nFactors)
library(psych)
# Import your data
data <- read.xlsx("data.xlsx")
# Ensure your data is properly scaled
# Important: Do NOT standardize the data for covariance matrix PCA
n_subjects <- nrow(data)
n_variables <- ncol(data)
# Calculate the covariance matrix
cov_mat <- cov(data, use = "pairwise.complete.obs")
# Calculate eigenvalues from your actual data
eigen_observed <- eigen(cov_mat)$values
2. Implement Parallel Analysis with Correct Parameters
# Perform parallel analysis for covariance matrix
results <- parallel(
var = n_variables,
subject = n_subjects,
rep = 1000, # Number of replications
quantile = 0.95, # 95th percentile cutoff
model = "components", # PCA model
sd = sqrt(diag(cov_mat)) # Critical: use actual standard deviations
)
# Extract simulated eigenvalues
simulated_eigen_mean <- results$eigen$mevpea
simulated_eigen_quantile <- results$eigen$qevpea
3. Compare and Determine Component Retention
# Create comparison dataframe
comparison <- data.frame(
Component = 1:n_variables,
Observed = eigen_observed,
Simulated_Mean = simulated_eigen_mean,
Simulated_95th = simulated_eigen_quantile,
Retain = eigen_observed > simulated_eigen_quantile
)
# View results
print(comparison)
Alternative Approaches
Using the paran Package
The paran package is specifically designed for Horn’s Parallel Analysis:
library(paran)
# Perform parallel analysis
pa_result <- paran(
cov_mat,
n.iter = 1000,
cent = 0.05,
quantile = TRUE
)
# Plot results
plot(pa_result)
# Get suggested number of components
pa_result$nfact
Manual Implementation for Better Control
# Manual parallel implementation
set.seed(123) # For reproducibility
n_sim <- 1000
sim_eigen <- matrix(NA, nrow = n_variables, ncol = n_sim)
for (i in 1:n_sim) {
# Generate random data with same covariance structure
random_data <- matrix(rnorm(n_subjects * n_variables),
nrow = n_subjects, ncol = n_variables)
# Scale to match original variances (critical for covariance matrix)
random_data_scaled <- sweep(random_data, 2,
sqrt(diag(cov_mat)), `*`)
# Calculate covariance matrix and eigenvalues
cov_sim <- cov(random_data_scaled)
sim_eigen[, i] <- eigen(cov_sim)$values
}
# Calculate mean and quantiles
sim_mean <- apply(sim_eigen, 1, mean)
sim_95th <- apply(sim_eigen, 1, quantile, probs = 0.95)
# Compare with observed eigenvalues
comparison <- data.frame(
Component = 1:n_variables,
Observed = eigen_observed,
Simulated_Mean = sim_mean,
Simulated_95th = sim_95th,
Retain = eigen_observed > sim_95th
)
Troubleshooting Common Issues
Issue 1: All Eigenvalues Are Extremely Small
Problem: Your eigenvalues are in the range of 10^-6 to 10^-3, which is unusually small.
Solution:
- Check the scale of your original data
- Verify that you’re not standardizing the data before covariance calculation
- Ensure your data isn’t already normalized or scaled
Issue 2: Empirical Eigenvalues Always Lower Than Simulated
Problem: This indicates a fundamental issue with your approach.
Solutions:
- Fix the parameter typo: Change
quantiltoquantile - Verify standard deviation specification: Use
sd = sqrt(diag(cov_mat)) - Check data integrity: Ensure your covariance matrix is correctly calculated
- Increase replications: Try
rep = 2000for more stable estimates
Issue 3: Inconsistent Results Across Runs
Problem: Parallel analysis results vary significantly between runs.
Solution:
- Set a seed for reproducibility:
set.seed(123) - Increase the number of replications:
rep = 2000 - Use a higher quantile:
quantile = 0.99
Complete R Code Example
# Complete Horn's Parallel Analysis for Covariance Matrix PCA
# Load libraries
library(nFactors)
library(psych)
library(openxlsx)
# Set seed for reproducibility
set.seed(123)
# Import data
data <- read.xlsx("data.xlsx")
# Basic information about the data
cat("Number of observations:", nrow(data), "\n")
cat("Number of variables:", ncol(data), "\n")
# Calculate covariance matrix (do NOT standardize for covariance PCA)
cov_mat <- cov(data, use = "pairwise.complete.obs")
# Calculate observed eigenvalues
eigen_observed <- eigen(cov_mat)$values
# Perform parallel analysis
results <- parallel(
var = ncol(data),
subject = nrow(data),
rep = 2000, # Increased replications for stability
quantile = 0.95, # 95th percentile
model = "components",
sd = sqrt(diag(cov_mat)) # Critical parameter for covariance matrices
)
# Extract simulated eigenvalues
sim_mean <- results$eigen$mevpea
sim_95th <- results$eigen$qevpea
# Create comparison dataframe
comparison <- data.frame(
Component = 1:length(eigen_observed),
Observed_Eigenvalue = eigen_observed,
Simulated_Mean = sim_mean,
Simulated_95th = sim_95th,
Difference = eigen_observed - sim_95th,
Retain_Component = eigen_observed > sim_95th
)
# Display results
print("Eigenvalue Comparison:")
print(comparison)
# Determine number of components to retain
n_components <- sum(eigen_observed > sim_95th)
cat("\nRecommended number of components to retain:", n_components, "\n")
# Create visualization
plot(1:length(eigen_observed), eigen_observed,
type = "b", pch = 19, col = "blue",
xlab = "Component Number",
ylab = "Eigenvalue",
main = "Horn's Parallel Analysis - Covariance Matrix")
lines(1:length(eigen_observed), sim_mean,
type = "b", pch = 17, col = "red")
lines(1:length(eigen_observed), sim_95th,
type = "b", pch = 15, col = "green")
legend("topright",
legend = c("Observed", "Simulated Mean", "Simulated 95th"),
col = c("blue", "red", "green"),
pch = c(19, 17, 15))
# Additional diagnostic plots
par(mfrow = c(2, 2))
plotuScree(eigen_observed, main = "Scree Plot")
plotuScree(sim_95th, main = "Simulated 95th Percentile")
hist(eigen_observed, main = "Distribution of Observed Eigenvalues")
hist(sim_95th, main = "Distribution of Simulated 95th Percentile")
par(mfrow = c(1, 1))
Interpreting Results
When interpreting your parallel analysis results:
- Retention Rule: Retain components where observed eigenvalues exceed the simulated 95th percentile eigenvalues
- Component Ordering: Components are ordered by eigenvalue magnitude (largest to smallest)
- Decision Making: The first component where observed eigenvalue drops below the simulated threshold indicates the optimal number of components
Example Interpretation:
Component 1: Observed = 5.2, Simulated_95th = 1.8 → RETAIN
Component 2: Observed = 2.1, Simulated_95th = 1.5 → RETAIN
Component 3: Observed = 1.2, Simulated_95th = 1.3 → DO NOT RETAIN
In this example, you would retain 2 components.
The key insight from Horn’s original work is that parallel analysis corrects for the sample-size inflation of eigenvalues that occurs with the Kaiser criterion, providing a more accurate determination of the number of meaningful components in PCA.