GLM 4.5 Air

Complete Guide: Group by Geography in BigQuery

Learn how to group by geography in BigQuery while preserving geography columns. Discover multiple solutions including H3 ID grouping, window functions, and performance optimization techniques.

Question

How to group by geography in BigQuery while preserving the geography column?

I have the following code:

sql
SELECT 
    h3s.h3id, h3s.geog, 
    MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) 
    OVER (PARTITION BY h3s.h3id) 
FROM 
    rgns_h3_geogs_clipped h3s
CROSS JOIN 
    hotspot_centers htsp

I would like to group by h3s.h3id while keeping the geography column. When I try to remove the OVER clause and use GROUP BY h3id, geog instead, I get the error:

“Grouping by expressions of type GEOGRAPHY is not allowed”

How can I perform the grouping operation while retaining the geography column in BigQuery?

GLM 4.5 Air

Brief Answer

In BigQuery, you cannot directly group by geography columns because the GEOGRAPHY data type is not supported in GROUP BY clauses. To solve this while preserving the geography information, you can use the H3 identifier for grouping and then retrieve the geography through a join back to the original table using the H3 ID as the key.

Contents

Why BigQuery Doesn’t Allow GROUP BY on GEOGRAPHY

BigQuery’s SQL implementation has specific limitations on aggregation operations. The GEOGRAPHY data type represents complex spatial data that cannot be directly used in grouping operations. This limitation exists because:

  • Geographic data is multi-dimensional and complex compared to simple scalar values
  • Grouping requires exact equality, which is computationally expensive for spatial objects
  • BigQuery optimizes group operations for hash-based grouping, which doesn’t work efficiently with geography types

When you attempt GROUP BY h3id, geog, BigQuery fails because it cannot create a hash key for the geography column, even though the geography is already represented by the H3 ID.

Solution 1: Group by H3 ID and Join Back

The most efficient solution is to group by the H3 ID and then join back to retrieve the geography information:

sql
WITH grouped_data AS (
  SELECT 
      h3s.h3id,
      MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) 
      AS min_distance
  FROM 
      rgns_h3_geogs_clipped h3s
  CROSS JOIN 
      hotspot_centers htsp
  GROUP BY 
      h3s.h3id
)
SELECT 
    gd.h3id,
    h3s.geog,
    gd.min_distance
FROM 
    grouped_data gd
JOIN 
    rgns_h3_geogs_clipped h3s ON gd.h3id = h3s.h3id

This approach:

  1. Groups by the H3 ID (which is a string/integer)
  2. Calculates the minimum distance for each H3 cell
  3. Joins back to the original table to retrieve the geography column

Solution 2: Use Window Functions

If you need to preserve all rows while still performing the grouping operation, window functions can be a good alternative:

sql
SELECT 
    h3s.h3id,
    h3s.geog,
    MIN(ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))) 
    OVER (PARTITION BY h3s.h3id) AS min_distance
FROM 
    rgns_h3_geogs_clipped h3s
CROSS JOIN 
    hotspot_centers htsp
QUALIFY 
    ROW_NUMBER() OVER (PARTITION BY h3s.h3id ORDER BY h3s.geog) = 1

This approach:

  • Uses the window function as in your original query
  • Uses QUALIFY to return only one row per H3 ID
  • Preserves the geography column by filtering after window operations

Solution 3: Convert Geography to String

If you need to group by geographic characteristics, you can convert the geography to a string representation:

sql
SELECT 
    h3s.h3id,
    h3s.geog,
    ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)) AS distance
FROM 
    rgns_h3_geogs_clipped h3s
CROSS JOIN 
    hotspot_centers htsp
GROUP BY 
    h3s.h3id,
    h3s.geog,
    ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id))

However, this approach has limitations:

  • It may not work for all geography operations
  • Performance might be impacted
  • The distance calculation would need to be repeated for each group

Performance Considerations

When working with H3 and geographic data in BigQuery, consider these performance aspects:

  1. Indexing: H3 IDs can serve as natural indexes for spatial operations
  2. Join Strategy: Solution 1 (join back) typically performs better than repeated distance calculations
  3. Data Volume: For large datasets, consider pre-aggregating or using materialized views
  4. Cross Join Impact: The CROSS JOIN in your query can generate significant data - ensure you’re filtering appropriately

Here’s a performance-optimized version:

sql
WITH hotspots AS (
  SELECT 
      h3id,
      ST_DISTANCE(`carto-os`.carto.H3_CENTER(htsp.h3id), `carto-os`.carto.H3_CENTER(h3s.h3id)) AS distance
  FROM 
      rgns_h3_geogs_clipped h3s
  CROSS JOIN 
      hotspot_centers htsp
)
SELECT 
    h3s.h3id,
    h3s.geog,
    MIN(h.distance) AS min_distance
FROM 
    hotspots h
JOIN 
    rgns_h3_geogs_clipped h3s ON h.h3id = h3s.h3id
GROUP BY 
    h3s.h3id, h3s.geog

Best Practices for H3 Geographic Analysis

When working with H3 and geography data in BigQuery:

  1. Use H3 IDs for grouping: Since H3 IDs are string/integer representations of geography, they’re ideal for grouping operations
  2. Minimize geography operations in GROUP BY: Convert geography to H3 IDs whenever possible
  3. Leverage window functions: For preserving rows while performing aggregations
  4. Consider pre-aggregation: For repeated analysis, create pre-aggregated views
  5. Use appropriate H3 resolutions: Higher resolutions (smaller cells) provide more precision but increase data volume

By following these approaches, you can effectively group by geography in BigQuery while preserving the geography column information for your analysis.