NeuroAgent

Implementing Horn's Parallel Analysis for Covariance PCA

Complete guide to implementing Horn's Parallel Analysis for PCA on covariance matrices in R. Troubleshoot eigenvalue issues and determine optimal component retention with proper parameter settings.

#data-analysis #eigenvalue-analysis #horn-parallel-analysis #pca-covariance #r-programming #statistical-methods

11/07/2025, 04:42 PM

How to properly implement Horn’s Parallel Analysis to determine the number of principal components in PCA based on a covariance matrix?

I’m attempting to perform a parallel analysis (Horn, 1965) to determine the number of meaningful principal components in a PCA based on a covariance matrix (not a correlation matrix). I’ve tried using the nFactors package but haven’t been able to interpret or implement it correctly. When I ran the analysis, all the empirical eigenvalues were lower than the simulated ones, which seems incorrect.

What troubleshooting steps should I try next? Here’s my current implementation:

# Code for performing Horn's Parallel Analysis for PCA on Covariance Matrix

# Load necessary libraries ####
library(openxlsx)
library(nFactors)

# Importing the dataset
data <- read.xlsx("data.xlsx")

# Number of observations and variables
n_subjects <- nrow(data)
n_variables <- ncol(data)

# Calculate the covariance matrix
cov_mat <- cov(data, use = "pairwise.complete.obs")
cov_mat

# Calculate the standard deviation of variables
sd_vec <- sqrt(diag(cov_mat))
sd_vector <- as.vector(sd_vec)

# Calculating the eigenvalues
eigen_cov <- eigen(cov_mat)$values
eigen_cov <- as.vector(eigen_cov)

# Performing the parallel analysis

# Quantile of the eigenvalues distribution 
sim_qevpea <- parallel(
  var = n_variables,
  subject = n_subjects,
  rep = 1000,      
  quantile = 0.95, 
  sd = diag(sd_vector))$eigen$qevpea

# Mean of the eigenvalues distribution
sim_mevpea <- parallel(
  var = n_variables,
  subject = n_subjects,
  rep = 1000,      
  quantil = 0.95, 
  sd = diag(sd_vector))$eigen$mevpea

# Merging results into a dataframe
sim_mevpea <- as.data.frame(sim_mevpea)
sim_qevpea <- as.data.frame(sim_qevpea)
obs_eigenvalues <- as.data.frame(eigen_cov)
sim_eig_obs_eig <- cbind(sim_mevpea, sim_qevpea, obs_eigenvalues)

# Results
    sim_mevpea  sim_qevpea    eigen_empiric
1  0.016609276 0.018170467 2.755510e-04
2  0.014469410 0.015579270 2.351656e-04
3  0.012966035 0.013853537 1.452651e-04
4  0.011738829 0.012546877 1.289419e-04
5  0.010746659 0.011467116 1.105977e-04
6  0.009892356 0.010519131 8.850754e-05
7  0.009101438 0.009717212 6.505378e-05
8  0.008403761 0.008964388 5.331106e-05
9  0.007787579 0.008299819 4.988057e-05
10 0.007208755 0.007668394 3.628711e-05
11 0.006675843 0.007149570 2.814191e-05
12 0.006194959 0.006593400 2.418505e-05
13 0.005733585 0.006138405 1.627334e-05
14 0.005284943 0.005671669 1.355602e-05
15 0.004858114 0.005257232 1.123874e-05
16 0.004453399 0.004805642 9.359437e-06
17 0.004045374 0.004410657 7.997094e-06
18 0.003625045 0.004021005 7.779925e-06
19 0.003190694 0.003575597 1.775028e-06
20 0.002681099 0.003138437 6.740691e-07

Is there an issue with my implementation? Are there alternative approaches or corrections I should consider?

NeuroAgent

Horn’s Parallel Analysis for PCA on covariance matrices requires careful implementation due to the fundamental differences between covariance and correlation matrices. The issue with your implementation is likely related to how eigenvalues are calculated and compared between empirical and simulated data when working with covariance matrices.

When working with covariance matrices, eigenvalues can be significantly larger than those from correlation matrices, and the simulation approach must account for the actual scale and variance structure of your data. The extremely small eigenvalues you’re observing suggest a scaling issue in your implementation.

Understanding the Problem with Your Implementation
Correct Implementation Steps
Alternative Approaches
Troubleshooting Common Issues
Complete R Code Example
Interpreting Results
Sources

Understanding the Problem with Your Implementation

Your current implementation has several issues that explain why all empirical eigenvalues are lower than simulated ones:

Parameter Typo: In your second parallel() call, you used quantil instead of quantile
Eigenvalue Scale Issue: The eigenvalues from covariance matrices are typically much larger than those from correlation matrices, but your results show extremely small values
Standard Deviation Specification: The sd parameter needs to be properly specified for covariance matrix analysis
Data Scaling: The eigenvalues suggest your data might be on a very small scale or there’s a normalization issue

Correct Implementation Steps

Here’s how to properly implement Horn’s Parallel Analysis for covariance matrices:

1. Prepare Your Data Correctly

# Load necessary libraries
library(nFactors)
library(psych)

# Import your data
data <- read.xlsx("data.xlsx")

# Ensure your data is properly scaled
# Important: Do NOT standardize the data for covariance matrix PCA
n_subjects <- nrow(data)
n_variables <- ncol(data)

# Calculate the covariance matrix
cov_mat <- cov(data, use = "pairwise.complete.obs")

# Calculate eigenvalues from your actual data
eigen_observed <- eigen(cov_mat)$values

2. Implement Parallel Analysis with Correct Parameters

# Perform parallel analysis for covariance matrix
results <- parallel(
  var = n_variables,
  subject = n_subjects,
  rep = 1000,          # Number of replications
  quantile = 0.95,     # 95th percentile cutoff
  model = "components", # PCA model
  sd = sqrt(diag(cov_mat)) # Critical: use actual standard deviations
)

# Extract simulated eigenvalues
simulated_eigen_mean <- results$eigen$mevpea
simulated_eigen_quantile <- results$eigen$qevpea

3. Compare and Determine Component Retention

# Create comparison dataframe
comparison <- data.frame(
  Component = 1:n_variables,
  Observed = eigen_observed,
  Simulated_Mean = simulated_eigen_mean,
  Simulated_95th = simulated_eigen_quantile,
  Retain = eigen_observed > simulated_eigen_quantile
)

# View results
print(comparison)

Alternative Approaches

Using the paran Package

The paran package is specifically designed for Horn’s Parallel Analysis:

library(paran)

# Perform parallel analysis
pa_result <- paran(
  cov_mat,
  n.iter = 1000,
  cent = 0.05,
  quantile = TRUE
)

# Plot results
plot(pa_result)

# Get suggested number of components
pa_result$nfact

Manual Implementation for Better Control

# Manual parallel implementation
set.seed(123)  # For reproducibility
n_sim <- 1000
sim_eigen <- matrix(NA, nrow = n_variables, ncol = n_sim)

for (i in 1:n_sim) {
  # Generate random data with same covariance structure
  random_data <- matrix(rnorm(n_subjects * n_variables), 
                       nrow = n_subjects, ncol = n_variables)
  
  # Scale to match original variances (critical for covariance matrix)
  random_data_scaled <- sweep(random_data, 2, 
                             sqrt(diag(cov_mat)), `*`)
  
  # Calculate covariance matrix and eigenvalues
  cov_sim <- cov(random_data_scaled)
  sim_eigen[, i] <- eigen(cov_sim)$values
}

# Calculate mean and quantiles
sim_mean <- apply(sim_eigen, 1, mean)
sim_95th <- apply(sim_eigen, 1, quantile, probs = 0.95)

# Compare with observed eigenvalues
comparison <- data.frame(
  Component = 1:n_variables,
  Observed = eigen_observed,
  Simulated_Mean = sim_mean,
  Simulated_95th = sim_95th,
  Retain = eigen_observed > sim_95th
)

Troubleshooting Common Issues

Issue 1: All Eigenvalues Are Extremely Small

Problem: Your eigenvalues are in the range of 10^-6 to 10^-3, which is unusually small.

Solution:

Check the scale of your original data
Verify that you’re not standardizing the data before covariance calculation
Ensure your data isn’t already normalized or scaled

Issue 2: Empirical Eigenvalues Always Lower Than Simulated

Problem: This indicates a fundamental issue with your approach.

Solutions:

Fix the parameter typo: Change quantil to quantile
Verify standard deviation specification: Use sd = sqrt(diag(cov_mat))
Check data integrity: Ensure your covariance matrix is correctly calculated
Increase replications: Try rep = 2000 for more stable estimates

Issue 3: Inconsistent Results Across Runs

Problem: Parallel analysis results vary significantly between runs.

Solution:

Set a seed for reproducibility: set.seed(123)
Increase the number of replications: rep = 2000
Use a higher quantile: quantile = 0.99

Complete R Code Example

# Complete Horn's Parallel Analysis for Covariance Matrix PCA

# Load libraries
library(nFactors)
library(psych)
library(openxlsx)

# Set seed for reproducibility
set.seed(123)

# Import data
data <- read.xlsx("data.xlsx")

# Basic information about the data
cat("Number of observations:", nrow(data), "\n")
cat("Number of variables:", ncol(data), "\n")

# Calculate covariance matrix (do NOT standardize for covariance PCA)
cov_mat <- cov(data, use = "pairwise.complete.obs")

# Calculate observed eigenvalues
eigen_observed <- eigen(cov_mat)$values

# Perform parallel analysis
results <- parallel(
  var = ncol(data),
  subject = nrow(data),
  rep = 2000,          # Increased replications for stability
  quantile = 0.95,     # 95th percentile
  model = "components",
  sd = sqrt(diag(cov_mat))  # Critical parameter for covariance matrices
)

# Extract simulated eigenvalues
sim_mean <- results$eigen$mevpea
sim_95th <- results$eigen$qevpea

# Create comparison dataframe
comparison <- data.frame(
  Component = 1:length(eigen_observed),
  Observed_Eigenvalue = eigen_observed,
  Simulated_Mean = sim_mean,
  Simulated_95th = sim_95th,
  Difference = eigen_observed - sim_95th,
  Retain_Component = eigen_observed > sim_95th
)

# Display results
print("Eigenvalue Comparison:")
print(comparison)

# Determine number of components to retain
n_components <- sum(eigen_observed > sim_95th)
cat("\nRecommended number of components to retain:", n_components, "\n")

# Create visualization
plot(1:length(eigen_observed), eigen_observed, 
     type = "b", pch = 19, col = "blue",
     xlab = "Component Number", 
     ylab = "Eigenvalue",
     main = "Horn's Parallel Analysis - Covariance Matrix")
lines(1:length(eigen_observed), sim_mean, 
      type = "b", pch = 17, col = "red")
lines(1:length(eigen_observed), sim_95th, 
      type = "b", pch = 15, col = "green")
legend("topright", 
       legend = c("Observed", "Simulated Mean", "Simulated 95th"),
       col = c("blue", "red", "green"),
       pch = c(19, 17, 15))

# Additional diagnostic plots
par(mfrow = c(2, 2))
plotuScree(eigen_observed, main = "Scree Plot")
plotuScree(sim_95th, main = "Simulated 95th Percentile")
hist(eigen_observed, main = "Distribution of Observed Eigenvalues")
hist(sim_95th, main = "Distribution of Simulated 95th Percentile")
par(mfrow = c(1, 1))

Interpreting Results

When interpreting your parallel analysis results:

Retention Rule: Retain components where observed eigenvalues exceed the simulated 95th percentile eigenvalues
Component Ordering: Components are ordered by eigenvalue magnitude (largest to smallest)
Decision Making: The first component where observed eigenvalue drops below the simulated threshold indicates the optimal number of components

Example Interpretation:

Component 1: Observed = 5.2, Simulated_95th = 1.8 → RETAIN
Component 2: Observed = 2.1, Simulated_95th = 1.5 → RETAIN
Component 3: Observed = 1.2, Simulated_95th = 1.3 → DO NOT RETAIN

In this example, you would retain 2 components.

The key insight from Horn’s original work is that parallel analysis corrects for the sample-size inflation of eigenvalues that occurs with the Kaiser criterion, providing a more accurate determination of the number of meaningful components in PCA.

Sources

How to interpret Horn's Parallel Analysis results for different quantile thresholds?What are the differences between parallel analysis for correlation vs covariance matrices?How to implement parallel analysis in Python for PCA component selection?What sample size requirements exist for reliable Horn's Parallel Analysis results?How to handle missing data in Horn's Parallel Analysis implementation?What are the advantages of Horn's method over Kaiser criterion for component retention?

Ask NeuroAgent