GLM 4.5 Air

Complete Guide: Group by Geography in BigQuery

Learn how to group by geography in BigQuery while preserving geography columns. Discover workarounds for BigQuery's geography grouping limitation with practical SQL examples and best practices.

Question

How to group by geography in BigQuery while preserving the geography column?

I have the following BigQuery SQL code:

sql
SELECT 
    h3s.h3id, h3s.geog, 
    MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) 
    OVER (PARTITION BY h3s.h3id) 
FROM 
    rgns_h3_geogs_clipped h3s
CROSS JOIN 
    hotspot_centers htsp

I want to group by h3s.h3id while keeping the geography column. When I try to remove the OVER clause and use GROUP BY h3id, geog instead, I get this error:

Grouping by expressions of type GEOGRAPHY is not allowed

How can I perform the grouping operation in BigQuery while preserving the geography column?

GLM 4.5 Air

How to Group by Geography in BigQuery While Preserving the Geography Column

Brief Answer:
You can’t directly group by GEOGRAPHY columns in BigQuery, but there are several workarounds. The most straightforward approach is to use a subquery or CTE to calculate your aggregations first, then join back with the original table to preserve the geography column. Alternatively, you can keep using window functions as in your original query if that produces the desired results.

Navigation Structure


Why BigQuery Doesn’t Allow Grouping by Geography

BigQuery has a limitation that prevents grouping directly by GEOGRAPHY data types. This restriction exists because geography objects are complex with variable dimensions and can be computationally expensive to compare for grouping purposes. When you try to use GROUP BY h3id, geog, BigQuery throws the error “Grouping by expressions of type GEOGRAPHY is not allowed” because it cannot efficiently compare geography objects for grouping.

This limitation doesn’t mean you can’t work with geography in aggregations; you just need to use alternative approaches to achieve similar results.


Solution 1: Using Window Functions (Original Approach)

Your original approach using window functions is actually quite suitable for this case:

sql
SELECT 
    h3s.h3id, 
    h3s.geog, 
    MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) 
    OVER (PARTITION BY h3s.h3id) AS min_distance
FROM 
    rgns_h3_geogs_clipped h3s
CROSS JOIN 
    hotspot_centers htsp

This approach:

  • Preserves all rows from your cross join
  • Adds a new column with the minimum distance for each h3id
  • Keeps the geography column intact
  • Avoids the grouping limitation entirely

The window function calculates the minimum distance within each partition (defined by h3s.h3id), effectively giving you the same result as a GROUP BY would, but without the geography grouping restriction.


Solution 2: Using a Subquery with GROUP BY

If you specifically need to use GROUP BY, you can separate the aggregation from the geography selection:

sql
WITH min_distances AS (
  SELECT 
      h3s.h3id, 
      MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) AS min_distance
  FROM 
      rgns_h3_geogs_clipped h3s
  CROSS JOIN 
      hotspot_centers htsp
  GROUP BY 
      h3s.h3id
)
SELECT 
    h3s.h3id, 
    h3s.geog,
    md.min_distance
FROM 
    rgns_h3_geogs_clipped h3s
JOIN 
    min_distances md ON h3s.h3id = md.h3id

This approach:

  1. First calculates the minimum distance for each h3id in a CTE
  2. Then joins this result back with the original table to include the geography column
  3. Avoids grouping by geography entirely

This method is particularly useful when you need to combine multiple aggregations or when your query logic becomes complex.


Solution 3: Using ARRAY_AGG for Geography Preservation

If you need to preserve multiple geography values (though typically H3 IDs have one corresponding geography), you can use ARRAY_AGG:

sql
SELECT 
    h3s.h3id,
    ARRAY_AGG(DISTINCT h3s.geog)[OFFSET(0)] AS geog,  -- Takes first geography value
    MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) AS min_distance
FROM 
    rgns_h3_geogs_clipped h3s
CROSS JOIN 
    hotspot_centers htsp
GROUP BY 
    h3s.h3id

This approach:

  • Creates an array of all geography values for each H3 ID
  • Extracts the first element from the array using OFFSET(0)
  • Groups only by the H3 ID, not the geography
  • Preserves geography information without directly grouping by it

Note: This solution assumes that each h3id corresponds to a single geog value. If there are multiple geography values for the same H3 ID, this will arbitrarily pick one (the first one encountered in the aggregation).


Best Practices for Geospatial Aggregations in BigQuery

When working with geospatial data in BigQuery, consider these best practices:

  1. Use H3 IDs for grouping: Since H3 IDs are designed for hierarchical geospatial indexing, they’re ideal for grouping operations.

  2. Leverage window functions when appropriate: Window functions can often replace complex GROUP BY operations while preserving all original rows.

  3. Pre-aggregate when possible: If working with large datasets, consider pre-aggregating distances or other calculations in a separate table.

  4. Be cautious with CROSS JOINs: Cross joins can produce very large result sets. Consider using joins with conditions or more efficient spatial operations if possible.

  5. Choose appropriate geography aggregation functions:

    • ST_UNION_AGG: To merge multiple geography objects
    • ST_CENTROID_AGG: To get the centroid of multiple geography objects
    • ST_CONVEX_HULL_AGG: To get the convex hull of multiple geography objects

Conclusion

To group by geography in BigQuery while preserving the geography column:

  1. Your window function approach is valid and likely the most efficient solution for your specific query
  2. For GROUP BY requirements, use a subquery or CTE to calculate aggregations first, then join back with the original table
  3. Avoid directly grouping by GEOGRAPHY columns as BigQuery doesn’t support this operation
  4. Consider using H3-specific functions that might be more efficient than generic distance calculations

For your specific case, the window function approach is working correctly and preserving the geography column while calculating the minimum distance for each H3 ID. If you need the results in a different format (e.g., one row per H3 ID), then the CTE approach would be more appropriate.