PolarDB IMCI: Optimize SELECT * on Wide Tables
Mitigate read amplification for SELECT * on wide tables (500+ columns) in PolarDB IMCI. Column pruning, optimizer settings, and hybrid execution tips.
How can SELECT * queries on wide tables (500+ columns) in PolarDB IMCI be optimized to mitigate massive read amplification, given that columnstore filtering is fast but fetching all columns is inefficient?
PolarDB IMCI’s columnstore filtering is fast, but SELECT * queries on wide tables (500+ columns) cause massive read amplification because they force reading every column’s blocks. The quickest, highest-impact fix is column pruning — explicitly list the columns you need — and then apply IMCI-specific options (runtime filters, hybrid execution hints and optimizer switches), schema changes (vertical partitioning or materialized summaries) and scaling (read-only column-store / MPP) to reduce I/O and latency.
Contents
- PolarDB IMCI: Why SELECT * on Wide Tables Causes Read Amplification
- How to Optimize SELECT * Queries in PolarDB IMCI
- Column pruning (explicit projection)
- Vertical partitioning / narrow tables
- Materialized views and pre-aggregation
- Predicate pushdown & runtime filters
- Two‑phase fetch / covering-index pattern
- Hybrid execution and optimizer-driven plans
- Scale-out and read-only column-store nodes (MPP)
- Query-level tactics: sampling, LIMIT, ORDER BY caution
- IMCI Optimizer Settings and Hints
- When SELECT * Is Unavoidable: Practical Mitigations
- Monitoring, Testing and Best Practices
- Sources
- Conclusion
PolarDB IMCI: Why SELECT * on Wide Tables Causes Read Amplification
Columnar execution in IMCI stores data by column, so reading just a few columns touches a small fraction of disk and memory. But SELECT * forces the engine to read every column’s compressed blocks and often to materialize full rows — that’s the read amplification everyone complains about on tables with 500+ columns. In other words: columnstore pruning is fast, but SELECT * defeats pruning and multiplies I/O.
PolarDB’s IMCI architecture tries to mitigate this (hybrid execution, RID locators, vectorized scanning), and the optimizer can sometimes choose a more efficient path. Still, the vendor docs call out the plain fix: explicitly select only the columns you need rather than using SELECT * (see PolarDB guidance on IMCIs and column pruning). For deeper architecture detail see the PolarDB-IMCI paper for how hybrid execution and index strategies behave under wide-table workloads: https://arxiv.org/pdf/2305.08468 and the PolarDB docs: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.
How to Optimize SELECT * Queries in PolarDB IMCI
Below are practical techniques, ordered roughly by impact and ease. Start at the top (column pruning) and only move to schema or cluster-level changes if queries still read too much data.
Column pruning (explicit projection)
- Replace SELECT * with an explicit column list wherever production traffic or frequent analytics are involved. This lets the IMCI columnstore read only the required column blocks and dramatically reduces I/O.
- Example:
-- Bad: triggers full-column reads on a 500+ column table
SELECT * FROM wide_table WHERE event_date >= '2025-01-01' LIMIT 1000;
-- Good: reads far fewer column blocks
SELECT id, event_date, metric_1, metric_2
FROM wide_table
WHERE event_date >= '2025-01-01'
LIMIT 1000;
- Apply this at the application/ORM level (projection pushdown) so queries generated by your stack never default to SELECT *.
- PolarDB docs explicitly recommend listing only needed columns to avoid read amplification: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.
Vertical partitioning / narrow tables
- Split the wide table into a small “hot” table (core columns used by most queries) and one or more “wide” tables holding infrequently used columns. Join only when needed.
- Pros: huge reduction in per-query I/O for common access patterns. Cons: added joins, schema complexity, and more work on updates.
- Typical pattern: wide_table_core(pk, cols_used_often), wide_table_wide(pk, rarely_used_col_1, …).
Materialized views and pre-aggregation
- For dashboards and repetitive analytics queries, precompute results into materialized views or summary tables rather than scanning the full wide table repeatedly.
- This is a standard columnar optimization: pre-aggregate hot query patterns and read small views instead of the entire wide table. See general columnar guidance: https://motherduck.com/learn-more/columnar-storage-guide/ and Redshift notes on columnar patterns: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html.
Predicate pushdown & runtime filters
- Make predicates sargable and push them as early as possible (WHERE clauses, pushed-down filters in joined subqueries). Runtime filters (Bloom, min/max) can prune many column-store blocks, but they only help if filters are evaluated before heavy column reads.
- Runtime filters are described in PolarDB materials and community posts (Bloom and minmax filters): https://www.alibabacloud.com/blog/600274 and https://www.alibabacloud.com/blog/about-database-kernel-|-how-does-polardb-htap-implement-imci-query-optimization_600271.
Two‑phase fetch / covering-index pattern
- Two-phase pattern: use columnstore to find matching primary keys (small set), then fetch the full rows from row-store (or via RID lookup) only for that small set. This reduces read amplification when SELECT * is semantically needed but the match set is small.
- The IMCI/hybrid path supports this approach in certain cost regimes (see the IMCI paper): https://arxiv.org/pdf/2305.08468.
Hybrid execution and optimizer-driven plans
- IMCI can pick column-based or row-based (hybrid) execution depending on cost. You can also influence behavior with optimizer switches and hints (details below). Example community hint that’s commonly used:
SELECT /*+ SET_VAR(imci_optimizer_switch='force_hybrid_index_search=ON') */
*
FROM wide_table
WHERE analytic_column > 1000
ORDER BY timestamp_col
LIMIT 1000;
(Example pattern referenced in community threads; test on your workload before broad use.)
Scale-out and read-only column-store nodes (MPP)
- For heavy analytical workloads that legitimately scan many columns, use multi-node MPP and read-only column-store nodes to absorb load and parallelize I/O. See PolarDB MPP guidance: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/use-multi-machine-mpp-to-speed-up-mass-data-analysis.
Query-level tactics: sampling, LIMIT, ORDER BY caution
- If SELECT * is used for sampling or ad-hoc inspection, apply LIMIT and avoid expensive ORDER BYs that force full materialization. Prefer TABLESAMPLE (if available) or SELECT … LIMIT n with no ordering for cheap samples.
- For pagination avoid queries that require full table materialization; paginate on indexed columns.
IMCI Optimizer Settings and Hints
PolarDB exposes IMCI-specific optimizer flags that can help the engine choose a lower-I/O plan for wide tables. Use them carefully and test impact.
- Example global switch recommended by docs to enable IMCI behaviors:
SET GLOBAL loose_imci_optimizer_switch =
'use_imci_card_est=ON,use_imci_join_reorder=ON';
- Per-query hint (example of forcing hybrid index search):
SELECT /*+ SET_VAR(imci_optimizer_switch='force_hybrid_index_search=ON') */ ...
- Why these matter: enabling join reorder and accurate IMCI cardinality estimation helps the cost-based optimizer build plans that apply filters early and avoid reading unnecessary column blocks. The PolarDB IMCI docs explain these options and their impact: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/imci-query-optimization.
Diagnostics:
- Use the IMCI pruner metrics and INFORMATION_SCHEMA.IMCI_COLUMNS to see how many blocks are skipped and which columns are involved; the IMCI pruner docs show how pruning reduces scanned blocks: https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/imci-optimization-for-statistic-queries.
- Use EXPLAIN to inspect whether the optimizer chose a columnstore scan, hybrid strategy, or row-scan, and measure bytes read and wall time in tests.
Caveat: optimizer flags can improve one query and regress another. Toggle per-session or per-query and keep regression tests.
When SELECT * Is Unavoidable: Practical Mitigations
Sometimes you must SELECT * (debugging, full ETL exports, third-party tools). Use mitigations to limit impact:
- Run such queries on a read-only row-store or hybrid node so you don’t trigger full column-store read amplification on the analytic nodes (PolarDB recommends using a read-only row-store/hybrid node for debugging SELECT *): https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/overview-29.
- Create a temporary snapshot: run CREATE TABLE snapshot AS SELECT * … LIMIT n (or export a small dataset) and query the snapshot.
- Schedule heavy SELECT * work during off-peak windows or on dedicated read replicas.
- If SELECT * is used by an internal tool, add a small caching layer or cache the results in a narrow table after the first read.
Monitoring, Testing and Best Practices
Measure before and after every change. Practical checklist:
- Find high-impact SELECT * queries via slow query logs, audit logs or monitoring dashboards.
- Replace the top offenders with explicit projection first — biggest wins come from the top 10% of queries.
- Use EXPLAIN to confirm whether plans use columnstore pruning, hybrid execution, or full scans.
- Monitor I/O (bytes read), CPU, and wall-time, plus IMCI pruner stats after changes.
- Test changes against production-like data volumes — pruning and cardinality effects often only show up at scale.
- Keep optimizer switch and hint changes scoped (per-session or per-query) until validated.
- Document schema changes (vertical partitioning), and update application code/ORMs so new tables and column lists are used consistently.
A simple phased plan:
- Audit queries and rank by cost.
- Replace SELECT * in high-volume queries with explicit column lists.
- Add covering summaries/materialized views for repeated patterns.
- If still problematic, evaluate vertical partitioning or force hybrid execution for the specific queries.
- If scans are legitimate analytics, scale read nodes (MPP) and schedule heavy workloads on dedicated nodes.
Sources
- PolarDB IMCI query optimization
- PolarDB Documentation - IMCIs / overview (column pruning guidance)
- PolarDB IMCI optimization for statistic queries (imci_pruner)
- PolarDB MPP: use multi-node MPP
- How does PolarDB HTAP implement IMCI query optimization (Alibaba Cloud blog)
- Community example (Stack Overflow) with SET_VAR hint pattern
- PolarDB-IMCI architecture paper (arXiv)
- General columnar storage guide (MotherDuck)
- Columnar storage in Amazon Redshift (context and best practices)
- Columnstore indexes overview (Microsoft Learn)
- Runtime filters and related blog notes (Alibaba Cloud)
- Columnar query fundamentals (DBA Stack Exchange)
Conclusion
To mitigate massive read amplification from SELECT * on wide tables in PolarDB IMCI, start with column pruning — explicitly select only the columns you need — then move to schema-level fixes (vertical partitioning, summary tables), IMCI optimizations (runtime filters, hybrid execution, loose_imci_optimizer_switch) and scaling (read-only column-store / MPP) as required. Measured changes (EXPLAIN, imci pruner stats, I/O metrics) will show the real wins: less data read, lower latency and more predictable analytic performance under PolarDB IMCI optimization.