How to group by geography in BigQuery while preserving the geography column?
I have the following BigQuery SQL code:
SELECT
h3s.h3id, h3s.geog,
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)))
OVER (PARTITION BY h3s.h3id)
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
I want to group by h3s.h3id
while keeping the geography column. When I try to remove the OVER
clause and use GROUP BY h3id, geog
instead, I get this error:
Grouping by expressions of type GEOGRAPHY is not allowed
How can I perform the grouping operation in BigQuery while preserving the geography column?
How to Group by Geography in BigQuery While Preserving the Geography Column
Brief Answer:
You can’t directly group by GEOGRAPHY columns in BigQuery, but there are several workarounds. The most straightforward approach is to use a subquery or CTE to calculate your aggregations first, then join back with the original table to preserve the geography column. Alternatively, you can keep using window functions as in your original query if that produces the desired results.
Navigation Structure
- Why BigQuery Doesn’t Allow Grouping by Geography
- Solution 1: Using Window Functions (Original Approach)
- Solution 2: Using a Subquery with GROUP BY
- Solution 3: Using ARRAY_AGG for Geography Preservation
- Best Practices for Geospatial Aggregations in BigQuery
- Conclusion
Why BigQuery Doesn’t Allow Grouping by Geography
BigQuery has a limitation that prevents grouping directly by GEOGRAPHY data types. This restriction exists because geography objects are complex with variable dimensions and can be computationally expensive to compare for grouping purposes. When you try to use GROUP BY h3id, geog
, BigQuery throws the error “Grouping by expressions of type GEOGRAPHY is not allowed” because it cannot efficiently compare geography objects for grouping.
This limitation doesn’t mean you can’t work with geography in aggregations; you just need to use alternative approaches to achieve similar results.
Solution 1: Using Window Functions (Original Approach)
Your original approach using window functions is actually quite suitable for this case:
SELECT
h3s.h3id,
h3s.geog,
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)))
OVER (PARTITION BY h3s.h3id) AS min_distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
This approach:
- Preserves all rows from your cross join
- Adds a new column with the minimum distance for each
h3id
- Keeps the geography column intact
- Avoids the grouping limitation entirely
The window function calculates the minimum distance within each partition (defined by h3s.h3id
), effectively giving you the same result as a GROUP BY would, but without the geography grouping restriction.
Solution 2: Using a Subquery with GROUP BY
If you specifically need to use GROUP BY, you can separate the aggregation from the geography selection:
WITH min_distances AS (
SELECT
h3s.h3id,
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) AS min_distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
GROUP BY
h3s.h3id
)
SELECT
h3s.h3id,
h3s.geog,
md.min_distance
FROM
rgns_h3_geogs_clipped h3s
JOIN
min_distances md ON h3s.h3id = md.h3id
This approach:
- First calculates the minimum distance for each
h3id
in a CTE - Then joins this result back with the original table to include the geography column
- Avoids grouping by geography entirely
This method is particularly useful when you need to combine multiple aggregations or when your query logic becomes complex.
Solution 3: Using ARRAY_AGG for Geography Preservation
If you need to preserve multiple geography values (though typically H3 IDs have one corresponding geography), you can use ARRAY_AGG:
SELECT
h3s.h3id,
ARRAY_AGG(DISTINCT h3s.geog)[OFFSET(0)] AS geog, -- Takes first geography value
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) AS min_distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
GROUP BY
h3s.h3id
This approach:
- Creates an array of all geography values for each H3 ID
- Extracts the first element from the array using
OFFSET(0)
- Groups only by the H3 ID, not the geography
- Preserves geography information without directly grouping by it
Note: This solution assumes that each
h3id
corresponds to a singlegeog
value. If there are multiple geography values for the same H3 ID, this will arbitrarily pick one (the first one encountered in the aggregation).
Best Practices for Geospatial Aggregations in BigQuery
When working with geospatial data in BigQuery, consider these best practices:
-
Use H3 IDs for grouping: Since H3 IDs are designed for hierarchical geospatial indexing, they’re ideal for grouping operations.
-
Leverage window functions when appropriate: Window functions can often replace complex GROUP BY operations while preserving all original rows.
-
Pre-aggregate when possible: If working with large datasets, consider pre-aggregating distances or other calculations in a separate table.
-
Be cautious with CROSS JOINs: Cross joins can produce very large result sets. Consider using joins with conditions or more efficient spatial operations if possible.
-
Choose appropriate geography aggregation functions:
ST_UNION_AGG
: To merge multiple geography objectsST_CENTROID_AGG
: To get the centroid of multiple geography objectsST_CONVEX_HULL_AGG
: To get the convex hull of multiple geography objects
Conclusion
To group by geography in BigQuery while preserving the geography column:
- Your window function approach is valid and likely the most efficient solution for your specific query
- For GROUP BY requirements, use a subquery or CTE to calculate aggregations first, then join back with the original table
- Avoid directly grouping by GEOGRAPHY columns as BigQuery doesn’t support this operation
- Consider using H3-specific functions that might be more efficient than generic distance calculations
For your specific case, the window function approach is working correctly and preserving the geography column while calculating the minimum distance for each H3 ID. If you need the results in a different format (e.g., one row per H3 ID), then the CTE approach would be more appropriate.