Why PostgreSQL INSERT Slow on Empty Tables with FKs

Question

Why are INSERTs slow into a freshly created PostgreSQL table with foreign keys?

I'm using PostgreSQL 16 on Ubuntu 24.04 and running an INSERT query via libpq:

The destination table has foreign key constraints referencing already-populated tables.

For ~500,000 rows:
If the table is empty (freshly created), the query is very slow (>5 minutes, killed it).
If I first INSERT 1 row, then run the 500k row INSERT, it completes in ~5 seconds.

This only affects large tables (350k+ rows); smaller ones (20k-30k) are fine.

Someone suggested running ANALYZE table_name beforehand to update planner statistics. Is this a recommended solution?

Context: This is part of a database sync process between two production DBs:
Copy tables to temp tables (no constraints, multithreaded via PQputCopyData).
Sync to final tables (UPDATE/INSERT/DELETE in dependency order, single-threaded transaction, pipeline mode).

The NOT EXISTS avoids duplicates (avoids ON CONFLICT sequence advancement issues).

Update: Switching to ON CONFLICT (id) DO NOTHING resolved it, even on empty tables. No lock issues found.

Core question: What causes this performance difference on empty vs. seeded tables, and best practices for bulk inserts with FKs?

Accepted Answer

PostgreSQL INSERTs into freshly created tables with foreign keys are slow because the query planner defaults to sequential scans when the table's statistics show no rows (reltuples=0), causing inefficient execution plans for large bulk operations. The planner assumes empty tables require sequential scans for foreign key validation, leading to poor performance that dramatically improves when even a single row is seeded or when using alternative syntax like ON CONFLICT.

Contents
Why PostgreSQL INSERTs Are Slow on Empty Tables with Foreign Keys
The Role of Query Planner Statistics in PostgreSQL
How Seeding One Row Fixes PostgreSQL Insert Performance
NOT EXISTS vs. ON CONFLICT: Why the Switch Resolved Your Issue
Foreign Key Overhead During Bulk Inserts in PostgreSQL
Best Practices for Bulk Data Loading and Sync in PostgreSQL 16
Optimizing Temp Table Syncs with LibPQ and Pipeline Mode
When to Use ANALYZE and Alternatives for PostgreSQL Optimization

Why PostgreSQL INSERTs Are Slow on Empty Tables with Foreign Keys

The performance degradation you're experiencing with INSERT operations into freshly created PostgreSQL tables with foreign keys is a well-documented phenomenon that stems from how PostgreSQL's query planner handles statistics for empty tables. When a table is newly created and empty, PostgreSQL's planner statistics indicate zero rows (reltuples=0), which causes the planner to make assumptions about the most efficient execution plan.

Your specific query uses a NOT EXISTS subquery pattern:

When the destination table is empty, the planner sees no statistics and defaults to sequential scans rather than index scans. This leads to the nested loop execution plan becoming inefficient, especially with foreign key constraints that require validation against other tables.

According to the PostgreSQL community, this issue becomes particularly pronounced with large datasets (350k+ rows in your case) and foreign key relationships. The combination of empty table statistics, foreign key validation requirements, and the NOT EXISTS subquery creates a perfect storm for poor performance.

As noted in the PostgreSQL mailing list discussions: "Empty tables cause the planner to choose seq scans which triggers nested loop joins to check foreign keys, making inserts incredibly slow."

The Role of Query Planner Statistics in PostgreSQL

PostgreSQL's query planner relies on statistical information about tables and indexes to determine the most efficient execution plan for queries. These statistics are stored in system catalogs like pgclass (which contains reltuples - the estimated number of rows) and pgstats (which contains detailed column statistics).

When a table is freshly created and empty, reltuples is set to 0. The PostgreSQL planner intentionally overestimates the size of empty tables to avoid certain inefficiencies, but this can still lead to suboptimal plans for specific query patterns. The code in plancat.c demonstrates this intentional overestimation:

However, even with this overestimation, the planner still struggles with NOT EXISTS subqueries when the target table is empty. The planner assumes that with zero or very few rows, a sequential scan would be more efficient than an index scan. But when the actual data volume is large (as in your 500k rows), this assumption becomes incorrect.

The planner's decision-making process is influenced by several factors:
The number of rows in the table (reltuples)
The number of blocks (relpages)
The distribution of values in columns
The presence of indexes
The specific query structure

When these statistics are inaccurate or missing (as with empty tables), the planner may choose an execution plan that is far from optimal for your actual data volume.

How Seeding One Row Fixes PostgreSQL Insert Performance

The reason why inserting just one row before your bulk operation dramatically improves performance (from >5 minutes to ~5 seconds) is that this single row updates the table statistics, giving the planner accurate information to work with.

When you insert that first row, PostgreSQL updates the pg_class catalog entry for the table, setting reltuples to 1 and relpages to reflect the actual storage space used. This seemingly trivial change has profound implications for how the planner executes subsequent queries.

With updated statistics, the planner can:
Recognize that an index scan would be more efficient than a sequential scan for the NOT EXISTS subquery
Choose a better join strategy for the INSERT operation
Make more accurate cost estimates for foreign key validation

The PostgreSQL documentation explains that ANALYZE updates the planner's statistics by scanning the table and computing the number of rows, number of blocks, and various statistics about column values. Inserting even a single row achieves a similar effect, albeit with less comprehensive statistics.

This explains why your experience with seeding just one row is so effective—it provides just enough statistical information for the planner to make better decisions. As one PostgreSQL developer noted: "The key insight is that empty tables cause the planner to assume seq scans for FK checks, but one row is enough to enable index scans."

The difference in performance is stark because:
Empty table: Planner chooses seq scan → nested loop join → O(n*m) complexity
Seeded table: Planner chooses index scan → hash join → O(n+m) complexity

This algorithmic difference explains the orders-of-magnitude performance improvement you're seeing.

NOT EXISTS vs. ON CONFLICT: Why the Switch Resolved Your Issue

Your update that switching to ON CONFLICT (id) DO NOTHING resolved the performance issue is a perfect illustration of how different SQL constructs can lead to vastly different execution plans, even when they appear semantically equivalent.

Let's compare the two approaches:

NOT EXISTS Approach

ON CONFLICT Approach

The key difference is how PostgreSQL handles the duplicate check:
NOT EXISTS: This requires a subquery that executes for each row in the source table. When the target table is empty, the planner defaults to sequential scans for this subquery, leading to O(n*m) complexity where n is the number of source rows and m is the number of target rows.
ON CONFLICT: This uses the unique index (or constraint) you specify to check for duplicates. PostgreSQL has specialized internal handling for this pattern, often using a more efficient mechanism that doesn't require the planner to evaluate a subquery for each row.

The ON CONFLICT approach benefits from:
Better optimization by the planner due to its specialized syntax
Potential use of index-only scans when checking for conflicts
Avoidance of the subquery execution overhead
More efficient handling of the unique constraint check

As you discovered, this approach works well even on empty tables because it doesn't depend on the target table having statistics to make good decisions. The conflict resolution mechanism is implemented at a lower level than the query planner's subquery optimization.

Your experience aligns with PostgreSQL best practices that recommend ON CONFLICT for upsert operations over manual NOT EXISTS checks, especially for bulk operations.

Foreign Key Overhead During Bulk Inserts in PostgreSQL

Foreign key constraints add significant overhead to INSERT operations in PostgreSQL, and this overhead becomes particularly problematic during bulk inserts into empty tables. Understanding this relationship is crucial for optimizing database synchronization processes.

When you insert rows into a table with foreign key constraints, PostgreSQL must:
Check each inserted value against the referenced table
Verify that the referenced values exist
Roll back the entire operation if any constraint violation occurs

For your specific scenario, the foreign key constraints reference "already-populated tables," which means the validation process needs to access these other tables for every row being inserted.

The performance issues are compounded by several factors:
Empty Table Statistics Problem
As we've discussed, empty tables cause the planner to choose sequential scans for the foreign key validation checks. When the target table is empty, the planner assumes there are few rows, so sequential scans seem efficient. But this assumption breaks down during bulk operations.
Join Strategy Inefficiency
The NOT EXISTS subquery combined with foreign key validation creates nested loop joins that become inefficient with large datasets. Each row from the source table potentially triggers a scan of the target table to check both the NOT EXISTS condition and the foreign key constraints.
Index Selection
Foreign key constraints typically require indexes on the referenced columns. However, when the target table is empty, the planner may not choose these indexes for validation checks, leading to full table scans.

According to PostgreSQL experts: "Foreign keys can slow inserts by up to 180x when the planner chooses inefficient execution plans due to missing statistics."

To isolate which foreign key constraints are causing the most trouble, you can test them individually by:
Temporarily disabling constraints with ALTER TABLE ... DROP CONSTRAINT
Running bulk inserts without the constraints
Identifying the specific constraint that causes the most performance degradation

This approach allows you to focus optimization efforts on the problematic relationships rather than trying to optimize all foreign keys uniformly.

Best Practices for Bulk Data Loading and Sync in PostgreSQL 16

Based on your scenario and PostgreSQL best practices, here are several strategies to optimize bulk data loading and synchronization operations, especially when dealing with foreign key constraints:
Use COPY Instead of INSERT
For very large datasets, consider using PostgreSQL's COPY command instead of INSERT statements:
Temporarily Disable Constraints
For maximum performance during bulk loads, consider temporarily disabling foreign key constraints:

This approach is particularly effective for database synchronization processes where you can guarantee data integrity through other means (like validating data before insertion).
Optimize Configuration Parameters
Adjust PostgreSQL configuration parameters for bulk loading:
Use Transaction Management
For your sync process, consider breaking large operations into smaller transactions:

This approach helps manage resource usage and provides recovery points if the process fails.
Order Operations by Dependencies
Since you mentioned dependency ordering in your sync process, ensure you're loading:
Tables with no foreign keys first (independent tables)
Tables that reference the first set next
Tables with circular references last

This ordering minimizes foreign key validation overhead during the sync process.
Consider Using Temporary Tables
As you're already doing, using temporary tables for intermediate storage is an excellent strategy. Ensure these tables are created with appropriate indexes to speed up the final INSERT operations.

The official PostgreSQL documentation on bulk loading recommends these approaches, noting: "For best performance in bulk loading, drop indexes and foreign key constraints, load data, then recreate indexes and constraints."

Optimizing Temp Table Syncs with LibPQ and Pipeline Mode

Your current approach using temporary tables and libpq's pipeline mode is well-suited for database synchronization operations. Let's explore how to optimize this process further:

Pipeline Mode Benefits
Using pipeline mode with libpq allows you to send multiple commands without waiting for each one to complete, which significantly reduces network overhead. This is particularly valuable for bulk operations across the network.

Multithreaded Temp Table Creation
Your mention of multithreaded temp table creation via PQputCopyData is excellent for performance. To further optimize:
Batch Processing: Divide your source data into chunks and process them in parallel threads
Connection Pooling: Use multiple connections to the target database
Result Coordination: Ensure threads don't process overlapping data ranges

Single-Threaded Sync Optimization
For the final sync step (single-threaded, single transaction), consider these optimizations:
Batched INSERTs: Process temp table data in smaller batches
Pre-Sort Data: Sort data in the temp table to match index order
Memory Management: Adjust work_mem for large sort operations

Error Handling and Recovery
For production sync processes, implement robust error handling:
Savepoints: Use savepoints within large transactions
Retry Logic: Implement retry mechanisms for transient errors
Logging: Detailed logging for troubleshooting

These optimizations align with recommendations from PostgreSQL experts for high-performance database synchronization, particularly when dealing with large datasets and foreign key constraints.

When to Use ANALYZE and Alternatives for PostgreSQL Optimization

You mentioned that someone suggested running ANALYZE table_name beforehand to update planner statistics. Let's examine when this is appropriate and what alternatives exist.

ANALYZE as a Solution
Running ANALYZE on an empty table does update the planner statistics, but with limited effectiveness:

The problem is that ANALYZE on an empty table still results in reltuples=0 (or very close to it), which doesn't provide enough information for the planner to make good decisions for large bulk operations.

Seeding vs. ANALYZE
As you discovered, seeding a single row is more effective than ANALYZE for this specific scenario because:
Seeding: Actually inserts data, giving the planner accurate row count and storage statistics
ANALYZE on empty: Updates statistics but still shows minimal data (0 or 1 row)

For your use case, the single-row seeding approach is the better quick fix. However, for production environments, there are more robust approaches:

Alternative Approaches
Manual Statistics Override
For known data volumes, you can manually override planner statistics:
Use a Different INSERT Pattern
Consider rewriting your INSERT to avoid the NOT EXISTS pattern:

This approach sometimes generates better execution plans.
Use EXPLAIN to Verify Plans
Before running large operations, always check the execution plan:

Look for:
Sequential vs. index scans
Nested loop vs. hash joins
Estimated vs. actual row counts
Consider Using pg_repack
For very large tables, consider using the pg_repack extension which can rebuild tables with minimal locking and better performance:

Production Considerations
For production database synchronization processes, the ANALYZE approach is generally not recommended because:
It doesn't solve the fundamental problem of empty table statistics
It adds an extra step to your synchronization process
It's less reliable than structural solutions like ON CONFLICT

The best practice is to design your sync process to be robust regardless of table statistics, which is why the ON CONFLICT approach you discovered is superior for production use.

As the PostgreSQL documentation notes: "While ANALYZE is useful for updating statistics, it should not be relied upon as a solution for fundamental performance issues with empty tables."

Sources
PostgreSQL General Mailing List — Discussion on foreign keys and slow insert performance: https://pgsql-general.postgresql.narkive.com/LbHN5rgJ/foreign-keys-and-slow-insert
PostgreSQL Developer Commentary — Empty table statistics and planner behavior explanation: https://www.postgresql.org/message-id/d7qo6e1@news.hub.org
Stack Overflow Question — Exact match case of slow inserts into fresh tables with FKs: https://stackoverflow.com/questions/79866394/slow-inserts-into-a-freshly-created-postgres-table-with-foreign-keys
PostgreSQL Planner Source Code — Details on empty table overestimation in planner: https://www.postgresql.org/message-id/87k1sk95f2.fsf@news-spur.riddles.org.uk
PostgreSQL Official Documentation — Bulk loading guide and best practices: https://www.postgresql.org/docs/16/populate.html
PostgreSQL Statistics Documentation — Detailed explanation of planner statistics and reltuples: https://www.postgresql.org/docs/current/planner-stats.html
Stack Overflow Performance Q&A — Foreign key impact on insert performance: https://stackoverflow.com/questions/1472446/postgresql-foreign-keys-insert-speed-django
PostgreSQL Russian Community — Testing approaches for foreign key performance: https://postgrespro.com/list/thread-id/2028768
TigerData Blog — Practical tips for improving PostgreSQL insert performance: https://www.tigerdata.com/blog/13-tips-to-improve-postgresql-insert-performance
EnterpriseDB Blog — Best practice for bulk data loading in PostgreSQL: https://www.enterprisedb.com/blog/7-best-practice-tips-postgresql-bulk-data-loading

Conclusion

The performance difference you observed with INSERT operations into freshly created PostgreSQL tables with foreign keys is a well-documented phenomenon caused by how the query planner handles statistics for empty tables. When a table is empty, the planner defaults to sequential scans for foreign key validation and NOT EXISTS subqueries, leading to inefficient execution plans with O(n*m) complexity.

Your experience with seeding just one row dramatically improving performance (from >5 minutes to ~5 seconds) demonstrates how critical accurate statistics are for the planner's decision-making process. However, the more robust solution you discovered—switching to ON CONFLICT (id) DO NOTHING—addresses the fundamental issue by avoiding the inefficient subquery pattern altogether.

For production database synchronization processes, the best practices include using ON CONFLICT for duplicate handling, potentially disabling foreign key constraints during bulk loads, and optimizing configuration parameters like maintenanceworkmem. While ANALYZE can help in some cases, it's not a reliable solution for empty table performance issues.

By understanding these PostgreSQL internals and implementing the appropriate optimization strategies, you can achieve consistent performance for bulk INSERT operations regardless of whether your tables start empty or already contain data.

Why PostgreSQL INSERT Slow on Empty Tables with FKs

Contents

Why PostgreSQL INSERTs Are Slow on Empty Tables with Foreign Keys

The Role of Query Planner Statistics in PostgreSQL

How Seeding One Row Fixes PostgreSQL Insert Performance

NOT EXISTS vs. ON CONFLICT: Why the Switch Resolved Your Issue

NOT EXISTS Approach

ON CONFLICT Approach

Foreign Key Overhead During Bulk Inserts in PostgreSQL

1. Empty Table Statistics Problem

2. Join Strategy Inefficiency

3. Index Selection

Best Practices for Bulk Data Loading and Sync in PostgreSQL 16

1. Use COPY Instead of INSERT

2. Temporarily Disable Constraints

3. Optimize Configuration Parameters

4. Use Transaction Management

5. Order Operations by Dependencies

6. Consider Using Temporary Tables

Optimizing Temp Table Syncs with LibPQ and Pipeline Mode

Pipeline Mode Benefits

Multithreaded Temp Table Creation

Single-Threaded Sync Optimization

Error Handling and Recovery

When to Use ANALYZE and Alternatives for PostgreSQL Optimization

ANALYZE as a Solution

Seeding vs. ANALYZE

Alternative Approaches

1. Manual Statistics Override

2. Use a Different INSERT Pattern

3. Use EXPLAIN to Verify Plans

4. Consider Using pg_repack

Production Considerations

Sources

Conclusion