What are the best strategies to optimize SELECT * queries on wide tables (500+ columns) in PolarDB IMCI to mitigate massive read amplification, given that columnstore filtering is fast but fetching all columns is inefficient?

PolarDB IMCI: Optimize SELECT * on Wide Tables

Mitigate read amplification for SELECT * on wide tables (500+ columns) in PolarDB IMCI. Column pruning, optimizer settings, and hybrid execution tips.

1 answer• 1 view

12/31/2025, 11:53 AM

How can SELECT * queries on wide tables (500+ columns) in PolarDB IMCI be optimized to mitigate massive read amplification, given that columnstore filtering is fast but fetching all columns is inefficient?

PolarDB IMCI’s columnstore filtering is fast, but SELECT * queries on wide tables (500+ columns) cause massive read amplification because they force reading every column’s blocks. The quickest, highest-impact fix is column pruning — explicitly list the columns you need — and then apply IMCI-specific options (runtime filters, hybrid execution hints and optimizer switches), schema changes (vertical partitioning or materialized summaries) and scaling (read-only column-store / MPP) to reduce I/O and latency.

PolarDB IMCI: Why SELECT * on Wide Tables Causes Read Amplification
How to Optimize SELECT * Queries in PolarDB IMCI
Column pruning (explicit projection)
Vertical partitioning / narrow tables
Materialized views and pre-aggregation
Predicate pushdown & runtime filters
Two‑phase fetch / covering-index pattern
Hybrid execution and optimizer-driven plans
Scale-out and read-only column-store nodes (MPP)
Query-level tactics: sampling, LIMIT, ORDER BY caution
IMCI Optimizer Settings and Hints
When SELECT * Is Unavoidable: Practical Mitigations
Monitoring, Testing and Best Practices
Sources
Conclusion

PolarDB IMCI: Why SELECT * on Wide Tables Causes Read Amplification

Columnar execution in IMCI stores data by column, so reading just a few columns touches a small fraction of disk and memory. But SELECT * forces the engine to read every column’s compressed blocks and often to materialize full rows — that’s the read amplification everyone complains about on tables with 500+ columns. In other words: columnstore pruning is fast, but SELECT * defeats pruning and multiplies I/O.

PolarDB’s IMCI architecture tries to mitigate this (hybrid execution, RID locators, vectorized scanning), and the optimizer can sometimes choose a more efficient path. Still, the vendor docs call out the plain fix: explicitly select only the columns you need rather than using SELECT * (see PolarDB guidance on IMCIs and column pruning). For deeper architecture detail see the PolarDB-IMCI paper for how hybrid execution and index strategies behave under wide-table workloads: https://arxiv.org/pdf/2305.08468 and the PolarDB docs: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.

How to Optimize SELECT * Queries in PolarDB IMCI

Below are practical techniques, ordered roughly by impact and ease. Start at the top (column pruning) and only move to schema or cluster-level changes if queries still read too much data.

Column pruning (explicit projection)

Replace SELECT * with an explicit column list wherever production traffic or frequent analytics are involved. This lets the IMCI columnstore read only the required column blocks and dramatically reduces I/O.
Example:

sql

-- Bad: triggers full-column reads on a 500+ column table
SELECT * FROM wide_table WHERE event_date >= '2025-01-01' LIMIT 1000;

-- Good: reads far fewer column blocks
SELECT id, event_date, metric_1, metric_2
FROM wide_table
WHERE event_date >= '2025-01-01'
LIMIT 1000;

Apply this at the application/ORM level (projection pushdown) so queries generated by your stack never default to SELECT *.
PolarDB docs explicitly recommend listing only needed columns to avoid read amplification: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.

Vertical partitioning / narrow tables

Split the wide table into a small “hot” table (core columns used by most queries) and one or more “wide” tables holding infrequently used columns. Join only when needed.
Pros: huge reduction in per-query I/O for common access patterns. Cons: added joins, schema complexity, and more work on updates.
Typical pattern: wide_table_core(pk, cols_used_often), wide_table_wide(pk, rarely_used_col_1, …).

Materialized views and pre-aggregation

For dashboards and repetitive analytics queries, precompute results into materialized views or summary tables rather than scanning the full wide table repeatedly.
This is a standard columnar optimization: pre-aggregate hot query patterns and read small views instead of the entire wide table. See general columnar guidance: https://motherduck.com/learn-more/columnar-storage-guide/ and Redshift notes on columnar patterns: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html.

Predicate pushdown & runtime filters

Make predicates sargable and push them as early as possible (WHERE clauses, pushed-down filters in joined subqueries). Runtime filters (Bloom, min/max) can prune many column-store blocks, but they only help if filters are evaluated before heavy column reads.
Runtime filters are described in PolarDB materials and community posts (Bloom and minmax filters): https://www.alibabacloud.com/blog/600274 and https://www.alibabacloud.com/blog/about-database-kernel-|-how-does-polardb-htap-implement-imci-query-optimization_600271.

Two‑phase fetch / covering-index pattern

Two-phase pattern: use columnstore to find matching primary keys (small set), then fetch the full rows from row-store (or via RID lookup) only for that small set. This reduces read amplification when SELECT * is semantically needed but the match set is small.
The IMCI/hybrid path supports this approach in certain cost regimes (see the IMCI paper): https://arxiv.org/pdf/2305.08468.

Hybrid execution and optimizer-driven plans

IMCI can pick column-based or row-based (hybrid) execution depending on cost. You can also influence behavior with optimizer switches and hints (details below). Example community hint that’s commonly used:

sql

SELECT /*+ SET_VAR(imci_optimizer_switch='force_hybrid_index_search=ON') */
 *
FROM wide_table
WHERE analytic_column > 1000
ORDER BY timestamp_col
LIMIT 1000;

(Example pattern referenced in community threads; test on your workload before broad use.)

Scale-out and read-only column-store nodes (MPP)

For heavy analytical workloads that legitimately scan many columns, use multi-node MPP and read-only column-store nodes to absorb load and parallelize I/O. See PolarDB MPP guidance: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/use-multi-machine-mpp-to-speed-up-mass-data-analysis.

Query-level tactics: sampling, LIMIT, ORDER BY caution

If SELECT * is used for sampling or ad-hoc inspection, apply LIMIT and avoid expensive ORDER BYs that force full materialization. Prefer TABLESAMPLE (if available) or SELECT … LIMIT n with no ordering for cheap samples.
For pagination avoid queries that require full table materialization; paginate on indexed columns.

IMCI Optimizer Settings and Hints

PolarDB exposes IMCI-specific optimizer flags that can help the engine choose a lower-I/O plan for wide tables. Use them carefully and test impact.

Example global switch recommended by docs to enable IMCI behaviors:

sql

SET GLOBAL loose_imci_optimizer_switch =
 'use_imci_card_est=ON,use_imci_join_reorder=ON';

Per-query hint (example of forcing hybrid index search):

sql

SELECT /*+ SET_VAR(imci_optimizer_switch='force_hybrid_index_search=ON') */ ...

Why these matter: enabling join reorder and accurate IMCI cardinality estimation helps the cost-based optimizer build plans that apply filters early and avoid reading unnecessary column blocks. The PolarDB IMCI docs explain these options and their impact: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/imci-query-optimization.

Diagnostics:

Use the IMCI pruner metrics and INFORMATION_SCHEMA.IMCI_COLUMNS to see how many blocks are skipped and which columns are involved; the IMCI pruner docs show how pruning reduces scanned blocks: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/imci-optimization-for-statistic-queries.
Use EXPLAIN to inspect whether the optimizer chose a columnstore scan, hybrid strategy, or row-scan, and measure bytes read and wall time in tests.

Caveat: optimizer flags can improve one query and regress another. Toggle per-session or per-query and keep regression tests.

When SELECT * Is Unavoidable: Practical Mitigations

Sometimes you must SELECT * (debugging, full ETL exports, third-party tools). Use mitigations to limit impact:

Run such queries on a read-only row-store or hybrid node so you don’t trigger full column-store read amplification on the analytic nodes (PolarDB recommends using a read-only row-store/hybrid node for debugging SELECT *): https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.
Create a temporary snapshot: run CREATE TABLE snapshot AS SELECT * … LIMIT n (or export a small dataset) and query the snapshot.
Schedule heavy SELECT * work during off-peak windows or on dedicated read replicas.
If SELECT * is used by an internal tool, add a small caching layer or cache the results in a narrow table after the first read.

Monitoring, Testing and Best Practices

Measure before and after every change. Practical checklist:

Find high-impact SELECT * queries via slow query logs, audit logs or monitoring dashboards.
Replace the top offenders with explicit projection first — biggest wins come from the top 10% of queries.
Use EXPLAIN to confirm whether plans use columnstore pruning, hybrid execution, or full scans.
Monitor I/O (bytes read), CPU, and wall-time, plus IMCI pruner stats after changes.
Test changes against production-like data volumes — pruning and cardinality effects often only show up at scale.
Keep optimizer switch and hint changes scoped (per-session or per-query) until validated.
Document schema changes (vertical partitioning), and update application code/ORMs so new tables and column lists are used consistently.

A simple phased plan:

Audit queries and rank by cost.
Replace SELECT * in high-volume queries with explicit column lists.
Add covering summaries/materialized views for repeated patterns.
If still problematic, evaluate vertical partitioning or force hybrid execution for the specific queries.
If scans are legitimate analytics, scale read nodes (MPP) and schedule heavy workloads on dedicated nodes.

Sources

Conclusion

To mitigate massive read amplification from SELECT * on wide tables in PolarDB IMCI, start with column pruning — explicitly select only the columns you need — then move to schema-level fixes (vertical partitioning, summary tables), IMCI optimizations (runtime filters, hybrid execution, loose_imci_optimizer_switch) and scaling (read-only column-store / MPP) as required. Measured changes (EXPLAIN, imci pruner stats, I/O metrics) will show the real wins: less data read, lower latency and more predictable analytic performance under PolarDB IMCI optimization.

Authors

NeuroAnswers

Author

Verified by moderation