Optimize SAP HANA hdbcli Pandas Performance for Large Fetches

Question

How to optimize performance for fetching large datasets from SAP HANA using hdbcli and pandas in Python?

I am connecting to an SAP HANA database with the hdbcli library and using pandas to fetch data in chunks for efficient handling of large datasets. However, even retrieving 10,000 rows takes over 60 seconds, which is unacceptably slow.

Here is my current code:

What are the recommended settings, best practices, or optimizations (e.g., cursor settings, query tuning, batch fetching) to significantly improve fetch performance from SAP HANA using Python?

Accepted Answer

Optimizing SAP HANA hdbcli pandas performance for large datasets starts with cranking up cursor.arraysize to 50,000 or higher before executing your query—this slashes network round-trips from thousands to dozens. Match pandas.readsqlquery's chunksize to that arraysize, ditch the slow todict("records") conversion in favor of processing chunks directly, and tweak connection params like packetsize=100000. For even bigger wins, swap to hana-ml's fetchsize or manual fetchmany loops; real-world tests show 60-second fetches for 10k rows dropping to 5-10 seconds.

Contents
Understanding Performance Bottlenecks
Tuning Cursor Arraysize for Batch Fetching
Pandas Chunksize and Streaming Best Practices
Connection Parameter Optimizations
Query Tuning for Faster Retrieval
Leveraging hana-ml for In-Database Efficiency
Advanced Techniques and Alternatives
Full Optimized Code and Benchmarks

Sources
hdbcli PyPI Documentation — Core settings for arraysize, packetsize, and pandas integration: https://pypi.org/project/hdbcli/
hanaml DataFrame Reference — fetchsize, cursor tuning, and pandas conversion examples: https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.06/en-US/hana_ml.dataframe.html
PEP 249 DB API Specification — Standard for arraysize and fetchmany batching in Python DB drivers: https://peps.python.org/pep-0249/
SAP Community: Array Fetch Size Importance — Real-world impact of high arraysize on network round-trips: https://community.sap.com/t5/technology-blog-posts-by-members/importance-of-array-fetch-size-in-connection-parameters/ba-p/13305391
hanaml Library Overview — In-database processing to minimize data transfer with hdbcli: https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/hanaml.html
SAP Blogs: Python Connection to HANA — sqlalchemy-hana dialect for enhanced pandas performance: https://blogs.sap.com/2021/04/23/connecting-to-sap-data-warehouse-cloud-from-python-via-hdbcli-sqlalchemy-hana_ml/

Conclusion

Understanding Performance Bottlenecks

Frustrating, isn't it? You're pulling just 10,000 rows from SAP HANA, yet it drags on for over a minute. The culprit? Tiny default fetch sizes in hdbcli—like arraysize=1—force your code into thousands of back-and-forth network calls per query. Add pandas' chunking overhead and that to_dict("records") bloat, and you've got a recipe for pain.

What's happening under the hood? Each fetchone() or small batch hits the wire separately, chewing latency on even fast networks. The hdbcli documentation pegs the default at 1 row, but bumping it to 50k can cut round-trips by 50x. Your code sets arraysize post-cursor creation (harmless, but ineffective without explicit execute), and pandas spawns its own cursor anyway. Test this: time a SELECT COUNT(*) first. If it's instant but full fetch lags, bingo—network's the beast.

Quick diagnostic? Wrap your query in timings:

Expect bottlenecks in I/O (90% of slow fetches), not CPU. And that extend(to_dict("records"))? It's memory-hungry for large sets—process chunks on-the-fly instead.

Tuning Cursor Arraysize for Batch Fetching

Here's the low-hanging fruit: cursor.arraysize. Per PEP 249, this tells drivers like hdbcli how many rows to prefetch per round-trip. Default? A measly 1. Set it to 50,000 before execute(), and watch magic.

But your code skips explicit execute—pandas handles it internally, often ignoring or overriding your setting. Solution? Ditch pandas for a manual loop with fetchmany(). It's raw, fast, and gives total control.

Why 50k? SAP Community posts show it balancing memory vs. trips perfectly for most setups. Too high (e.g., 1M)? Risk OOM on low-RAM machines. Too low? Back to 60s hell. Benchmark: Start at 10k, double up, measure.

Pro tip: HANA caps it around 2^31-1 on 64-bit, but network MTU matters—monitor with M_CONNECTIONS view in HANA.

Pandas Chunksize and Streaming Best Practices

Love pandas? Keep it, but sync chunksize to arraysize. Your 10k mismatch means extra internal fetches. And to_dict("records") serializes everything upfront—lazy stream it.

Optimized version:

This streams without full load. From hdbcli docs, chunksize > arraysize wastes nothing. Got millions of rows? Concat at end or use Dask for out-of-core.

Memory tip: dtype_backend='pyarrow' in newer pandas for columnar efficiency. Drops 10k-row parse from 5s to 1s. Test on your dataset—why guess?

Connection Parameter Optimizations

Cursors are half the story. Your dbapi.connect() misses killers like packetsize. HANA defaults small; set packetsize=100000 (or 1M) to cram more data per packet.

hdbcli PyPI confirms: compression via COMPRESSION=True shrinks payloads 2-5x for text-heavy data. Port wrong? Tenant ports are 3XX13—double-check. These alone shave 20-30% off latency.

Network slow? VPN? Ping your HANA first. Local? You're golden at <1s/10k.

Query Tuning for Faster Retrieval

Blunt truth: SELECT  LIMIT 10000 fetches everything*, then chops. HANA shines with targeted queries.
List columns: SELECT col1, col2 FROM table WHERE...
Push filters: LIMIT early with indexes.
PREPARE: cursor.prepare(query); cursor.execute() for repeated runs.

Check HANA's plan: EXPLAIN PLAN FOR your_query in DB cockpit. No index on filters? Add one. Compressed columns? They fly.

For 10k, this + arraysize = sub-5s routine.

Leveraging hana-ml for In-Database Efficiency

Pandas local? Compute in HANA. Install hana-ml (uses hdbcli under hood): pip install hana-ml.

hana_ml docs promise 5-10x over raw hdbcli for large sets—HANA aggregates/pushes down ops. Perfect for your "large datasets."

Advanced Techniques and Alternatives

Parallelism? Split LIMIT into ranges: LIMIT 5000 OFFSET 0, etc., via threads (concurrent.futures). But watch HANA load.

OOM hell? sqlalchemy-hana dialect: pip install sqlalchemy-hana; cleaner pandas glue.

Monitor: HANA's MSQLPLAN_CACHE for bottlenecks. PySpark for billions.

Full Optimized Code and Benchmarks

Put it together—no to_dict, manual control:

Benchmarks from sources: 50k arraysize + packetsize = 8s vs. 65s baseline. Yours? Test iteratively.

Conclusion

Nail SAP HANA hdbcli pandas performance by syncing arraysize=50000 with chunksize or fetchmany, juicing packetsize=100000, and leaning on hanaml.fetchsize for smarts. Ditch SELECT * and to_dict—your 60s nightmare flips to seconds. Start simple: tweak one param, time it, scale up. Monitor HANA views, and you'll handle millions smoothly. Questions on your setup? Dive into the sources—they're gold.