Long-running asyncio tasks freeze after approximately one hour: How to debug this issue?
I have a long-running Python application built on asyncio that launches several background tasks running indefinitely. These tasks occasionally perform CPU work using asyncio.to_thread. The application works correctly for about an hour, but then all async tasks stop executing completely. No exceptions are thrown, CPU and memory usage remain normal, but the application freezes entirely.
This issue has occurred across multiple projects that heavily utilize asyncio concurrency, though I’ve never been able to reproduce it in a minimal example until now.
Here’s a minimal example that eventually exhibits the freeze:
import asyncio
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=10)
async def dummy_task(i):
await asyncio.sleep(0.1)
return i
async def batch_eval(trades):
results = await asyncio.gather(*(dummy_task(t) for t in trades))
return results
def sync_batch_eval(trades):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
results = loop.run_until_complete(batch_eval(trades))
loop.close()
return results
async def main():
i = 0
while True:
trades = list(range(10))
res = await asyncio.get_running_loop().run_in_executor(
executor, sync_batch_eval, trades
)
if i % 50 == 0:
print("iteration", i, "ok")
i += 1
await asyncio.sleep(0.05)
asyncio.run(main())
After running for a period (10-60 minutes depending on the system), the output stops printing and no errors appear. CPU usage drops to near zero, but the process remains alive.
My actual code follows a similar pattern where I call a function like:
def sync_batch_eval(trades, settings, allow_edge):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
async def run_batch():
tasks = [batch_evaluate_trade(trade, settings, allow_edge) for trade in trades]
return await asyncio.gather(*tasks)
results = loop.run_until_complete(run_batch())
loop.close()
return results
Why do all asyncio tasks eventually freeze despite no exceptions occurring and normal system resource usage?
Long-running asyncio tasks freezing after approximately one hour is typically caused by resource leaks, event loop exhaustion, or multi-threaded synchronization issues in Python’s asyncio implementation. The most common culprits are database connections not being properly closed, ThreadPoolExecutor resource accumulation, or issues with creating multiple event loops across threads without proper synchronization.
Contents
- Understanding the Root Causes
- Common Resource Leak Patterns
- Event Loop Management Issues
- Debugging Strategies
- Prevention and Solutions
- Advanced Monitoring Techniques
Understanding the Root Causes
The freezing you’re experiencing after approximately one hour is a well-documented issue in asyncio applications, particularly when using run_in_executor with multi-threaded operations. According to Stack Overflow research, this is likely related to multi-threaded asyncio issues in Python up to version 3.13.
The core problem often stems from:
- Resource exhaustion: Database connections, file handles, or other resources accumulating without proper cleanup
- Event loop corruption: Multiple event loops created in different threads without proper synchronization
- Thread executor limitations: Default ThreadPoolExecutor being overwhelmed by long-running operations
- Memory leaks: Improper task and coroutine references preventing garbage collection
“In short, you likely were hit by an issue in multi-threaded asyncio in Python up to 3.13, and if you are not already, the first thing you should try there is to move to Python 3.14.” - Stack Overflow analysis
Common Resource Leak Patterns
Your minimal example demonstrates several patterns that commonly lead to resource leaks and eventual freezing:
Database Connection Leaks
In your actual code, the batch_evaluate_trade function likely creates database connections that aren’t being properly closed. This is a classic pattern that leads to connection pool exhaustion:
# PROBLEMATIC PATTERN - Missing cleanup
def sync_batch_eval(trades, settings, allow_edge):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
async def run_batch():
tasks = [batch_evaluate_trade(trade, settings, allow_edge) for trade in trades]
return await asyncio.gather(*tasks)
results = loop.run_until_complete(run_batch())
# loop.close() called here, but connections within tasks may not be closed
return results
As Markaicode’s debugging experience shows:
“That single missing await statement was causing 50+ database connections to leak every hour during peak traffic. By the time we noticed the problem, our connection pool was completely exhausted, and new requests were hanging indefinitely.”
ThreadPoolExecutor Resource Accumulation
Your use of run_in_executor with a shared ThreadPoolExecutor can lead to thread exhaustion. Each call to sync_batch_eval creates a new event loop and may spawn additional threads in the executor.
Improper Task References
Using create_task in a “fire and forget” manner without keeping references can cause subtle memory leaks:
# PROBLEMATIC: No task reference kept
async def some_function():
get_event_loop().create_task(coroutine(*args))
# Task may be garbage collected before completion
According to Stack Overflow research:
“When using create_task in a ‘fire and forget’ way, we should keep the references alive for the reliable execution.”
Event Loop Management Issues
Creating new event loops in each thread call is problematic. The asyncio.new_event_loop() call in your code can lead to several issues:
Multiple Event Loop Problems
Creating multiple event loops across threads without proper synchronization is a known issue. As one Python issue tracker notes:
“BaseEventLoop.close() shutdowns the executor without waiting causing leak of dangling threads”
Event Loop Exhaustion
Repeatedly creating and destroying event loops can exhaust system resources. The Python documentation suggests:
“To mitigate this, consider using a custom executor for other user tasks, or setting a default executor with a larger number of workers.” - Python 3.14 Documentation
Thread Safety Issues
Asyncio itself is not thread-safe, and when dealing with thread pool executors, manual synchronization is required:
“Asyncio itself is not thread-safe and when dealing with a thread pool or process pool executors, manual synchronization of shared resources is needed.” - Codilime Blog
Debugging Strategies
When your application freezes after an hour, here are effective debugging approaches:
1. Upgrade to Python 3.14
The most immediate solution is to upgrade to Python 3.14, which addresses many of the multi-threaded asyncio issues present in earlier versions.
2. Monitor Resource Usage
Track the following metrics over time:
- Number of active threads
- Database connections in use
- Memory usage patterns
- Event loop task counts
3. Use Leak Detection Tools
Tools like pyleak can help identify leaks:
“pyleak uses an external monitoring thread to detect when the event loop actually becomes unresponsive regardless of what’s causing it, then captures stack traces showing exactly where the blocking occurred. Plus pyleak also detects asyncio task leaks and thread leaks with full stack trace”
4. Manual Task Tracking
Implement manual tracking of long-running tasks to identify which ones are stuck:
“You can find all stuck long-running tasks in asyncio by manually tracking how long each task has been alive and reporting task details if a threshold ‘too long’ time is exceeded. This approach can be used to find all stuck, hanging, and zombie asyncio tasks in the event loop.” - Super Fast Python
5. Debug Mode with Enhanced Logging
Enable asyncio debug mode and add comprehensive logging:
import asyncio
asyncio.get_event_loop().set_debug(True)
Prevention and Solutions
1. Implement Proper Resource Cleanup
Ensure all resources are properly closed using context managers or try/finally blocks:
async def cleanup_user_session(session_id):
connection = await get_db_connection()
try:
await connection.execute("DELETE FROM sessions WHERE id = ?", session_id)
finally:
# Always clean up resources in asyncio - the event loop won't do it for you
await connection.close()
2. Use Persistent Event Loops
Instead of creating new event loops for each batch operation, maintain a persistent event loop:
class BatchProcessor:
def __init__(self):
self.loop = asyncio.new_event_loop()
self.executor = ThreadPoolExecutor(max_workers=10)
async def process_batch(self, trades, settings, allow_edge):
tasks = [batch_evaluate_trade(trade, settings, allow_edge) for trade in trades]
return await asyncio.gather(*tasks)
def sync_batch_eval(self, trades, settings, allow_edge):
try:
return self.loop.run_until_complete(
self.process_batch(trades, settings, allow_edge)
)
except Exception as e:
# Handle exceptions appropriately
raise
3. Limit Thread Pool Size
Configure your ThreadPoolExecutor with appropriate limits:
import concurrent.futures
# Use a custom executor with controlled size
executor = concurrent.futures.ThreadPoolExecutor(
max_workers=min(32, (os.cpu_count() or 1) * 4),
thread_name_prefix='batch-worker'
)
4. Implement Timeouts and Circuit Breakers
Add timeouts to prevent indefinite hanging:
async def batch_eval_with_timeout(trades, timeout=30):
try:
return await asyncio.wait_for(
asyncio.gather(*(dummy_task(t) for t in trades)),
timeout=timeout
)
except asyncio.TimeoutError:
# Handle timeout appropriately
raise TimeoutError("Batch evaluation timed out")
5. Regular Health Checks
Implement periodic health checks to monitor system state:
async def health_check():
while True:
# Check thread count
thread_count = threading.active_count()
# Check database connections
db_connections = get_active_db_connections()
# Check event loop tasks
loop = asyncio.get_running_loop()
task_count = len(asyncio.all_tasks(loop))
print(f"Health: {thread_count} threads, {db_connections} DB connections, {task_count} tasks")
await asyncio.sleep(60) # Check every minute
Advanced Monitoring Techniques
Memory Profiling
Use memory profiling tools to identify growing memory usage patterns:
import tracemalloc
tracemalloc.start()
# Take snapshots periodically
snapshot1 = tracemalloc.take_snapshot()
# ... run your application ...
snapshot2 = tracemalloc.take_snapshot()
# Compare snapshots
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
print(stat)
Thread Stack Analysis
Capture thread stacks when issues occur to identify blocking operations:
import threading
import sys
def dump_threads():
for thread_id, frame in sys._current_frames().items():
print(f"\nThread {thread_id}:")
traceback.print_stack(frame)
Event Loop Inspection
Regularly inspect the state of your event loop:
def inspect_event_loop(loop):
print(f"Active tasks: {len(asyncio.all_tasks(loop))}")
print(f"Pending calls: {len(loop._callbacks)}")
print(f"Ready calls: {len(loop._ready)}")
print(f"Scheduled calls: {len(loop._scheduled)}")
Sources
- Stack Overflow - Long running asyncio tasks freeze after ~1 hour. How can I debug?
- Super Fast Python - Find Stuck and Long Running Tasks in Asyncio
- Markaicode - Debugging Python Asyncio Concurrency Issues
- Python 3.14 Documentation - Event Loop
- Codilime - How to Run Blocking Functions in The Event Loop
- Reddit - pyleak - detect leaked asyncio tasks, threads, and event loop blocking
- Python Issue 41699 - Potential memory leak with asyncio and run_in_executor
- Python Issue 34037 - asyncio: BaseEventLoop.close() shutdowns the executor without waiting causing leak of dangling threads
Conclusion
The freezing of asyncio tasks after approximately one hour is typically caused by resource accumulation, event loop management issues, or multi-threading problems. Based on the research and common patterns:
- Upgrade to Python 3.14 as the most immediate solution to address multi-threaded asyncio issues
- Implement proper resource cleanup using try/finally blocks or context managers
- Avoid creating new event loops for each batch operation; use persistent loops instead
- Monitor and limit thread pool usage to prevent executor exhaustion
- Use debugging tools like pyleak and manual task tracking to identify issues early
- Add health checks and timeouts to prevent indefinite hanging
The key takeaway is that asyncio requires careful management of resources and event loops, especially in multi-threaded environments. The freezing after one hour is often the result of gradual resource accumulation rather than sudden failures, making regular monitoring and proactive resource management essential for long-running applications.