How to group by geography in BigQuery while preserving the geography column?
I have the following code:
SELECT
h3s.h3id, h3s.geog,
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)))
OVER (PARTITION BY h3s.h3id)
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
I would like to group by h3s.h3id while keeping the geography column. When I try to remove the OVER clause and use GROUP BY h3id, geog instead, I get the error:
“Grouping by expressions of type GEOGRAPHY is not allowed”
How can I perform the grouping operation while retaining the geography column in BigQuery?
Brief Answer
In BigQuery, you cannot directly group by geography columns because the GEOGRAPHY data type is not supported in GROUP BY clauses. To solve this while preserving the geography information, you can use the H3 identifier for grouping and then retrieve the geography through a join back to the original table using the H3 ID as the key.
Contents
- Why BigQuery Doesn’t Allow GROUP BY on GEOGRAPHY
- Solution 1: Group by H3 ID and Join Back
- Solution 2: Use Window Functions
- Solution 3: Convert Geography to String
- Performance Considerations
- Best Practices for H3 Geographic Analysis
Why BigQuery Doesn’t Allow GROUP BY on GEOGRAPHY
BigQuery’s SQL implementation has specific limitations on aggregation operations. The GEOGRAPHY data type represents complex spatial data that cannot be directly used in grouping operations. This limitation exists because:
- Geographic data is multi-dimensional and complex compared to simple scalar values
- Grouping requires exact equality, which is computationally expensive for spatial objects
- BigQuery optimizes group operations for hash-based grouping, which doesn’t work efficiently with geography types
When you attempt GROUP BY h3id, geog
, BigQuery fails because it cannot create a hash key for the geography column, even though the geography is already represented by the H3 ID.
Solution 1: Group by H3 ID and Join Back
The most efficient solution is to group by the H3 ID and then join back to retrieve the geography information:
WITH grouped_data AS (
SELECT
h3s.h3id,
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)))
AS min_distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
GROUP BY
h3s.h3id
)
SELECT
gd.h3id,
h3s.geog,
gd.min_distance
FROM
grouped_data gd
JOIN
rgns_h3_geogs_clipped h3s ON gd.h3id = h3s.h3id
This approach:
- Groups by the H3 ID (which is a string/integer)
- Calculates the minimum distance for each H3 cell
- Joins back to the original table to retrieve the geography column
Solution 2: Use Window Functions
If you need to preserve all rows while still performing the grouping operation, window functions can be a good alternative:
SELECT
h3s.h3id,
h3s.geog,
MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)))
OVER (PARTITION BY h3s.h3id) AS min_distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
QUALIFY
ROW_NUMBER() OVER (PARTITION BY h3s.h3id ORDER BY h3s.geog) = 1
This approach:
- Uses the window function as in your original query
- Uses QUALIFY to return only one row per H3 ID
- Preserves the geography column by filtering after window operations
Solution 3: Convert Geography to String
If you need to group by geographic characteristics, you can convert the geography to a string representation:
SELECT
h3s.h3id,
h3s.geog,
ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)) AS distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
GROUP BY
h3s.h3id,
h3s.geog,
ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))
However, this approach has limitations:
- It may not work for all geography operations
- Performance might be impacted
- The distance calculation would need to be repeated for each group
Performance Considerations
When working with H3 and geographic data in BigQuery, consider these performance aspects:
- Indexing: H3 IDs can serve as natural indexes for spatial operations
- Join Strategy: Solution 1 (join back) typically performs better than repeated distance calculations
- Data Volume: For large datasets, consider pre-aggregating or using materialized views
- Cross Join Impact: The CROSS JOIN in your query can generate significant data - ensure you’re filtering appropriately
Here’s a performance-optimized version:
WITH hotspots AS (
SELECT
h3id,
ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)) AS distance
FROM
rgns_h3_geogs_clipped h3s
CROSS JOIN
hotspot_centers htsp
)
SELECT
h3s.h3id,
h3s.geog,
MIN(h.distance) AS min_distance
FROM
hotspots h
JOIN
rgns_h3_geogs_clipped h3s ON h.h3id = h3s.h3id
GROUP BY
h3s.h3id, h3s.geog
Best Practices for H3 Geographic Analysis
When working with H3 and geography data in BigQuery:
- Use H3 IDs for grouping: Since H3 IDs are string/integer representations of geography, they’re ideal for grouping operations
- Minimize geography operations in GROUP BY: Convert geography to H3 IDs whenever possible
- Leverage window functions: For preserving rows while performing aggregations
- Consider pre-aggregation: For repeated analysis, create pre-aggregated views
- Use appropriate H3 resolutions: Higher resolutions (smaller cells) provide more precision but increase data volume
By following these approaches, you can effectively group by geography in BigQuery while preserving the geography column information for your analysis.