Databases

PolarDB IMCI: Optimize SELECT * on Wide Tables

Mitigate read amplification for SELECT * on wide tables (500+ columns) in PolarDB IMCI. Column pruning, optimizer settings, and hybrid execution tips.

1 answer 1 view

How can SELECT * queries on wide tables (500+ columns) in PolarDB IMCI be optimized to mitigate massive read amplification, given that columnstore filtering is fast but fetching all columns is inefficient?

PolarDB IMCI’s columnstore filtering is fast, but SELECT * queries on wide tables (500+ columns) cause massive read amplification because they force reading every column’s blocks. The quickest, highest-impact fix is column pruning — explicitly list the columns you need — and then apply IMCI-specific options (runtime filters, hybrid execution hints and optimizer switches), schema changes (vertical partitioning or materialized summaries) and scaling (read-only column-store / MPP) to reduce I/O and latency.


Contents


PolarDB IMCI: Why SELECT * on Wide Tables Causes Read Amplification

Columnar execution in IMCI stores data by column, so reading just a few columns touches a small fraction of disk and memory. But SELECT * forces the engine to read every column’s compressed blocks and often to materialize full rows — that’s the read amplification everyone complains about on tables with 500+ columns. In other words: columnstore pruning is fast, but SELECT * defeats pruning and multiplies I/O.

PolarDB’s IMCI architecture tries to mitigate this (hybrid execution, RID locators, vectorized scanning), and the optimizer can sometimes choose a more efficient path. Still, the vendor docs call out the plain fix: explicitly select only the columns you need rather than using SELECT * (see PolarDB guidance on IMCIs and column pruning). For deeper architecture detail see the PolarDB-IMCI paper for how hybrid execution and index strategies behave under wide-table workloads: https://arxiv.org/pdf/2305.08468 and the PolarDB docs: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.


How to Optimize SELECT * Queries in PolarDB IMCI

Below are practical techniques, ordered roughly by impact and ease. Start at the top (column pruning) and only move to schema or cluster-level changes if queries still read too much data.

Column pruning (explicit projection)

  • Replace SELECT * with an explicit column list wherever production traffic or frequent analytics are involved. This lets the IMCI columnstore read only the required column blocks and dramatically reduces I/O.
  • Example:
sql
-- Bad: triggers full-column reads on a 500+ column table
SELECT * FROM wide_table WHERE event_date >= '2025-01-01' LIMIT 1000;

-- Good: reads far fewer column blocks
SELECT id, event_date, metric_1, metric_2
FROM wide_table
WHERE event_date >= '2025-01-01'
LIMIT 1000;

Vertical partitioning / narrow tables

  • Split the wide table into a small “hot” table (core columns used by most queries) and one or more “wide” tables holding infrequently used columns. Join only when needed.
  • Pros: huge reduction in per-query I/O for common access patterns. Cons: added joins, schema complexity, and more work on updates.
  • Typical pattern: wide_table_core(pk, cols_used_often), wide_table_wide(pk, rarely_used_col_1, …).

Materialized views and pre-aggregation

Predicate pushdown & runtime filters

Two‑phase fetch / covering-index pattern

  • Two-phase pattern: use columnstore to find matching primary keys (small set), then fetch the full rows from row-store (or via RID lookup) only for that small set. This reduces read amplification when SELECT * is semantically needed but the match set is small.
  • The IMCI/hybrid path supports this approach in certain cost regimes (see the IMCI paper): https://arxiv.org/pdf/2305.08468.

Hybrid execution and optimizer-driven plans

  • IMCI can pick column-based or row-based (hybrid) execution depending on cost. You can also influence behavior with optimizer switches and hints (details below). Example community hint that’s commonly used:
sql
SELECT /*+ SET_VAR(imci_optimizer_switch='force_hybrid_index_search=ON') */
 *
FROM wide_table
WHERE analytic_column > 1000
ORDER BY timestamp_col
LIMIT 1000;

(Example pattern referenced in community threads; test on your workload before broad use.)

Scale-out and read-only column-store nodes (MPP)

Query-level tactics: sampling, LIMIT, ORDER BY caution

  • If SELECT * is used for sampling or ad-hoc inspection, apply LIMIT and avoid expensive ORDER BYs that force full materialization. Prefer TABLESAMPLE (if available) or SELECT … LIMIT n with no ordering for cheap samples.
  • For pagination avoid queries that require full table materialization; paginate on indexed columns.

IMCI Optimizer Settings and Hints

PolarDB exposes IMCI-specific optimizer flags that can help the engine choose a lower-I/O plan for wide tables. Use them carefully and test impact.

  • Example global switch recommended by docs to enable IMCI behaviors:
sql
SET GLOBAL loose_imci_optimizer_switch =
 'use_imci_card_est=ON,use_imci_join_reorder=ON';
  • Per-query hint (example of forcing hybrid index search):
sql
SELECT /*+ SET_VAR(imci_optimizer_switch='force_hybrid_index_search=ON') */ ...

Diagnostics:

Caveat: optimizer flags can improve one query and regress another. Toggle per-session or per-query and keep regression tests.


When SELECT * Is Unavoidable: Practical Mitigations

Sometimes you must SELECT * (debugging, full ETL exports, third-party tools). Use mitigations to limit impact:

  • Run such queries on a read-only row-store or hybrid node so you don’t trigger full column-store read amplification on the analytic nodes (PolarDB recommends using a read-only row-store/hybrid node for debugging SELECT *): https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.
  • Create a temporary snapshot: run CREATE TABLE snapshot AS SELECT * … LIMIT n (or export a small dataset) and query the snapshot.
  • Schedule heavy SELECT * work during off-peak windows or on dedicated read replicas.
  • If SELECT * is used by an internal tool, add a small caching layer or cache the results in a narrow table after the first read.

Monitoring, Testing and Best Practices

Measure before and after every change. Practical checklist:

  • Find high-impact SELECT * queries via slow query logs, audit logs or monitoring dashboards.
  • Replace the top offenders with explicit projection first — biggest wins come from the top 10% of queries.
  • Use EXPLAIN to confirm whether plans use columnstore pruning, hybrid execution, or full scans.
  • Monitor I/O (bytes read), CPU, and wall-time, plus IMCI pruner stats after changes.
  • Test changes against production-like data volumes — pruning and cardinality effects often only show up at scale.
  • Keep optimizer switch and hint changes scoped (per-session or per-query) until validated.
  • Document schema changes (vertical partitioning), and update application code/ORMs so new tables and column lists are used consistently.

A simple phased plan:

  1. Audit queries and rank by cost.
  2. Replace SELECT * in high-volume queries with explicit column lists.
  3. Add covering summaries/materialized views for repeated patterns.
  4. If still problematic, evaluate vertical partitioning or force hybrid execution for the specific queries.
  5. If scans are legitimate analytics, scale read nodes (MPP) and schedule heavy workloads on dedicated nodes.

Sources


Conclusion

To mitigate massive read amplification from SELECT * on wide tables in PolarDB IMCI, start with column pruning — explicitly select only the columns you need — then move to schema-level fixes (vertical partitioning, summary tables), IMCI optimizations (runtime filters, hybrid execution, loose_imci_optimizer_switch) and scaling (read-only column-store / MPP) as required. Measured changes (EXPLAIN, imci pruner stats, I/O metrics) will show the real wins: less data read, lower latency and more predictable analytic performance under PolarDB IMCI optimization.

Authors
Verified by moderation
Moderation
PolarDB IMCI: Optimize SELECT * on Wide Tables