Fast PostGIS Polygon Clustering for Zoom Maps
Optimize large polygon datasets in PostGIS for zoom-based rendering. Use GiST/BRIN indexes, ST_ClusterKMeans, ST_ClusterWithin, vector tiles (MVT), geometry simplification for millisecond queries on 300k+ records.
How to optimally cluster large polygon datasets in PostGIS for zoom-based map rendering?
In PostGIS/PostgreSQL, I have a table named prediction with about 300,000 records structured as follows:
sub_type: fk to prediction_sub_type table
end_time: timestamp
county: fk to county table
geometry: geometry(polygon, 4326)
I typically filter predictions by county, sub_type, and end_time (specific to each sub_type), then return the data to the frontend for map display. However, the dataset size can reach up to 500MB, causing performance issues.
I need a zoom-based fetching strategy that clusters polygons at low zoom levels while preserving the base shape of the data, with query times in milliseconds.
I tried ST_ClusterDBSCAN combined with ST_ConcaveHull, but it takes over 30 minutes. What are faster alternatives, indexing strategies, or optimization techniques for efficient clustering of large polygon datasets in PostGIS?
For fast PostGIS polygon clustering on large datasets like your 300k-record prediction table, ditch runtime ST_ClusterDBSCAN—it’s computationally brutal on complex polygons. Instead, precompute clusters per zoom level with ST_ClusterKMeans or ST_ClusterWithin, simplify geometries using ST_Simplify, and serve via vector tiles (MVT) for millisecond zoom-based map rendering. Pair this with GiST spatial indexes and BRIN for geohash-ordered tables to slash query times while filtering by county, sub_type, and end_time.
Contents
- Why Naive Clustering Fails
- Key Indexing Strategies
- Precompute Vector Tiles by Zoom
- Faster Clustering Alternatives
- Geometry Simplification Tricks
- Full Pipeline with SQL Examples
- Query Optimization and Monitoring
- Sources
- Conclusion
Why Naive Clustering Fails
Picture this: 300,000 polygons, each potentially complex with dozens of vertices, filtered by county and sub_type, then hit with ST_ClusterDBSCAN and ST_ConcaveHull. No wonder it drags on for 30+ minutes. DBSCAN shines for density-based grouping, but on polygons? It computes distances between every pair—O(n²) nightmare territory. Add concave hulls, which trace outlines via alpha shapes, and you’re compounding expensive geometry ops across the whole dataset.
Your filters help, but without proper indexes, PostGIS still scans too much. Frontend rendering chokes on 500MB payloads too. The fix? Shift to server-side pre-aggregation tuned for zoom levels. At low zooms (say, 1-8), merge into fewer, simplified clusters. Closer in? Reveal more detail. Tools like PostGIS vector tiles (implied in performance docs) make this feasible.
And here’s a reality check: real-world maps from Mapbox or Zalando don’t compute clusters live. They pre-tile everything.
Key Indexing Strategies
Indexes aren’t optional—they’re your first line of defense. Start with a GiST spatial index on geometry:
CREATE INDEX idx_prediction_geom ON prediction USING GIST (geometry);
This speeds bbox intersects massively, filtering your county/sub_type/end_time queries before geometry math kicks in. But for millions of rows? Layer on BRIN for cheap, block-level indexing after sorting by a geohash:
-- Add geohash column
ALTER TABLE prediction ADD COLUMN geohash text;
UPDATE prediction SET geohash = ST_GeoHash(geometry, 20);
-- BRIN index (super fast to build)
CREATE INDEX idx_prediction_brin ON prediction USING BRIN (geohash);
-- Physically reorder table for locality
CLUSTER prediction USING idx_prediction_geom;
Why geohash? It linearizes 2D space into 1D strings, perfect for BRIN’s block scans on static-ish data like predictions. PostGIS performance tips swear by CLUSTER for cache-friendly layouts—queries hit fewer pages. SP-GiST works for point-like accesses too, but GiST rules polygons.
Test with EXPLAIN ANALYZE. You’ll see index scans dominate.
Precompute Vector Tiles by Zoom
Zoom-based rendering screams for vector tiles. Generate MVTs per zoom (e.g., 0-14) with pre-clustered, simplified polygons. No live computation—queries fetch ready tiles intersecting the viewport.
Use ST_AsMVT in a script or pg_tileserv/TileServer GL:
-- Example tile query (run per zoom, bbox)
SELECT ST_AsMVT(tile, 'predictions', 4096, 'geom') AS mvt
FROM (
SELECT ST_ClusterKMeans(geometry, 50) OVER () AS geom, ... -- cluster to ~50 per tile
FROM prediction
WHERE ST_Intersects(geometry, !bbox_tile!) -- dynamic bbox
AND county = ? AND sub_type = ? AND end_time > ?
) AS tile;
Batch-generate with tools like Tippecanoe or clusterbuster, which iterates ST_ClusterDBSCAN from max zoom down. Store in PostGIS or PMTiles for single-file serving. Mapbox’s GeoJSON-VT proves on-the-fly works, but precompute crushes it for your scale.
Result? Frontend requests a tile by zoom/coords—boom, milliseconds.
Faster Clustering Alternatives
ST_ClusterDBSCAN is out. Try these from PostGIS clustering docs:
-
ST_ClusterWithin(geom, distance): DBSCAN with minpoints=0—groups anything within tolerance. Faster, as docs note.
-
ST_ClusterKMeans(geom, k): Fixed k clusters (e.g., 20-100 per zoom). Centroid-based, scales linearly-ish. Great for even distribution; Crunchy Data example shows it handling global data.
Grid binning as fallback: Snap to hexagons or snap grid, then ST_Union per bin.
-- KMeans example
SELECT ST_ClusterKMeans(geometry, 25) OVER (PARTITION BY county) AS cluster_id,
ST_Centroid(ST_Collect(geometry)) AS rep_geom
FROM prediction WHERE ...;
Tune k dynamically: fewer at low zoom, more at high. Endpoint Dev’s regionation benchmarks KMeans/DBSCAN—both beat desktop tools.
Geometry Simplification Tricks
Polygons kill perf with vertices. Simplify first:
-- Douglas-Peucker with topology preserve
UPDATE prediction
SET geometry = ST_SimplifyPreserveTopology(geometry, 0.0001); -- ~10m tolerance
Or subdivide beasts:
ALTER TABLE prediction ADD COLUMN simple_geom geometry(polygon,4326);
UPDATE prediction SET simple_geom = ST_Subdivide(geometry, 8);
Cluster on simple_geom, hull on originals if needed. Crunchy Data’s DBSCAN post visualizes how this reveals clusters without overload.
Low zooms? Aggressive tolerance (0.001°). High? None.
Full Pipeline with SQL Examples
- Index and simplify (one-time).
- Script per-zoom clustering: Python + psycopg2 or PL/Python, partition by county/sub_type.
- Generate MVTs: Loop zooms, query bbox-clustered data, ST_AsMVT.
- Serve: pg_tileserv or PostgREST endpoint
/tiles/{z}/{x}/{y}.mvt?county=1&sub_type=foo&end_time=now.
Sample precompute func:
CREATE OR REPLACE FUNCTION generate_clusters(z integer)
RETURNS TABLE(cluster_geom geometry, props jsonb) AS $$
SELECT
ST_Union(ST_SimplifyPreserveTopology(geom, tolerance(z))) AS cluster_geom,
jsonb_build_object('count', count(*), 'sub_type', sub_type) AS props
FROM (
SELECT geometry AS geom, sub_type,
ST_ClusterKMeans(geometry, LEAST(100, 2^z)) OVER (PARTITION BY county, sub_type) AS cid
FROM prediction
WHERE end_time > NOW() - INTERVAL '1 day' -- your filter
) c GROUP BY cid;
$$ LANGUAGE SQL;
Pipe to tiles. Zalando’s map post nails the grid+zoom logic.
Query Optimization and Monitoring
Live queries? Always bbox + indexes:
SELECT * FROM prediction
WHERE ST_Intersects(geometry, ST_Transform(ST_MakeBox2D(...), 4326))
AND county = 1 AND sub_type = 2 AND end_time > '2025-01-01'
ORDER BY end_time DESC;
VACUUM ANALYZE post-loads. Watch EXPLAIN for seq scans. Cache tiles in Redis. For 300k rows, expect <100ms post-optimizations per PostGIS workshop.
Scale issue? Shard by county.
Sources
- ST_ClusterDBSCAN - PostGIS Docs
- ST_ClusterWithin - PostGIS Docs
- ST_ClusterKMeans - PostGIS Docs
- Clustering on Indices - PostGIS Workshop
- Performance Tips - PostGIS Docs
- PostGIS Clustering with DBSCAN - Crunchy Data
- PostGIS Clustering with K-Means - Crunchy Data
- Regionating with PostGIS - Endpoint Dev
- Rendering Big Geodata - Mapbox Blog
- Maps with PostgreSQL and PostGIS - Zalando
- Clusterbuster GitHub
Conclusion
Nail PostGIS polygon clustering for zoom-based maps by precomputing simplified, KMeans-clustered MVT tiles per zoom, backed by GiST/BRIN indexes and geohash ordering. Skip DBSCAN runtime pain—your 300k polygons will render in milliseconds, filters intact. Test small, scale up, and watch those 30-minute waits vanish. Dive into the SQL, tweak tolerances, and your maps will fly.