Fix Snakemake RuntimeError: Can't Start New Thread

Question

Snakemake RuntimeError: can't start new thread when executing large workflow with many studies

I'm running a Snakemake workflow that performs the following steps:
Downloads studies from a database (each study contains data for multiple individuals).
Splits each study into one file per individual.
Filters data for each individual.

The workflow completes successfully with a few studies, but fails when processing many studies, raising the error:

This occurs near the end of execution (e.g., 98% complete), during job scheduling reevaluation in job_scheduler.py. Full traceback points to threading.py failing to start a new thread.

Command used:

It appears Snakemake may be retaining open job threads after completion, exhausting the thread limit. How can I resolve this issue and successfully run the workflow on large datasets?

NeuroAnswers · Accepted Answer

The Snakemake RuntimeError: can't start new thread strikes when your large workflow—processing tons of studies, splitting files per individual, and filtering data—hits Python's OS thread limits late in the run, like at 98% complete in jobscheduler.py. This happens because Snakemake's scheduler spins up threads for job monitoring and DAG reevaluation, exhausting available threads in massive DAGs with thousands of jobs. Fix it fast by adding --max-threads 1 to cap scheduler threads, bumping ulimit -u 4096 for more headroom, and tweaking --jobs lower; your updated command becomes snakemake --cores 3 --jobs 2 --max-threads 1 --resources memmb=7000 download_slots=1 ulimit -u 4096 --config ....

Contents
Understanding the Snakemake RuntimeError: can't start new thread
Root Causes: Thread Limits in Snakemake Large Workflows
Quick Fix: Using --max-threads in Snakemake CLI
System-Level Solution: Adjusting ulimit for Snakemake
Optimizing Snakemake Jobs, Cores, and Resources for Many Studies
Best Practices for Large DAGs and Checkpoints in Snakemake
Monitoring and Troubleshooting Snakemake Thread Exhaustion
Sources
Conclusion

Understanding the Snakemake RuntimeError: can't start new thread

Picture this: your workflow hums along perfectly for a handful of studies—downloads zip through, individuals get split out, filters apply without a hitch. Scale up to hundreds or thousands, and bam, right around 98% done, Snakemake chokes with RuntimeError: can't start new thread deep in job_scheduler.py and Python's threading.py.

Why late? The scheduler reevaluates the DAG after jobs finish, trying to queue more while holding threads from earlier ones. In this exact matching case on Stack Overflow, a similar pipeline failed at job 3476/3532. Your command (--cores 3 --jobs 3) lets Snakemake parallelize nicely at first, but the thread pile‑up kills it. Not a bug in your rules—it's resource exhaustion in big workflows.

Root Causes: Thread Limits in Snakemake Large Workflows

Snakemake large workflows expose Python's threading ceiling, typically around 900‑1000 threads per process before OS limits kick in. Each job needs monitor threads; with --jobs 3 and --cores 3, plus scheduler overhead, you're burning through them fast on 3500+ jobs from many studies.

Key culprits? The job scheduler in Snakemake creates threads for output checking, DAG updates, and retries—exacerbated in huge DAGs as noted in Snakemake GitHub issue #2354 for 40k‑job runs. OS enforces ulimit -u (max user processes/threads), often defaulting to 1024‑2048 on Linux. Python GIL doesn't help; threads compete for kernel resources. Ever seen it in Docker? Containers tighten limits further, per Docker forums.

And here's the kicker: small tests mask it. Few studies mean tiny DAGs, low thread use. Ramp up, and threads linger post‑job, starving new ones.

Quick Fix: Using --max-threads in Snakemake CLI

Don't overthink it—--max-threads 1 is your immediate savior. This caps the maximum threads any single job requests, overriding rule‑level threads: directives and forcing the scheduler to serialize threading aggressively.

Update your command:

Per the official Snakemake CLI docs, --max-threads defaults to min(cores, jobs) but overrides everything when set low. Users in that Stack Overflow thread confirmed it pushed their run to completion. Why --jobs 2 too? Reduces concurrent schedulers. Test it—your workflow should cruise past 98%.

Pro tip: Run snakemake --help | grep threads to verify.

System-Level Solution: Adjusting ulimit for Snakemake

CLI tweaks help, but if threads still cap out, crank the OS limit. ulimit -u governs user threads/processes—check yours with ulimit -a. Defaults like 1024? No wonder it fails on large datasets.

Boost it temporarily:

For persistence, edit /etc/security/limits.conf:

Reboot or log out/in. SuperUser diagnostics show monitoring via ps -fLu $PPID | wc -l (threads under your process) and cat /proc/sys/kernel/threads-max. Pair with --max-threads 1; one report fixed a Python app hitting 1000+.

Containers? Docker defaults tighter—docker run --ulimit nproc=4096:4096 ....

Optimizing Snakemake Jobs, Cores, and Resources for Many Studies

Your command uses --cores 3 --jobs 3, but for Snakemake large workflows, decouple them. --jobs controls concurrent jobs (scheduler threads), --cores total CPUs across jobs.

Tune like this:

| Flag | Role | Recommendation for Large Runs |
|------|------|-------------------------------|
| --jobs | Concurrent jobs | 2‑4 (lower = fewer scheduler threads) |
| --cores | Total CPUs | Match your machine (e.g., 12) |
| --max-threads | Per‑job threads | 1 (eliminates intra‑job parallelism) |
| --resources | Custom limits | Add io=2 for downloads, scale mem_mb |

In rules, declare conservatively:

Snakemake GitHub #2532 highlights threads: bugs forcing high --jobs. Advanced tutorial explains --cores distributes threads: with --cores 12 --max-threads 4, jobs get up to 4 each. Dial back parallelism for stability.

Best Practices for Large DAGs and Checkpoints in Snakemake

Massive study counts explode your DAG. Break it: use checkpoints to split downloads/splits/filters into phases.

Example Snakefile snippet:

This rebuilds DAG incrementally—fewer threads at once. From the advanced docs, checkpoints shine for dynamic outputs. Also, --latency-wait 60 for file system lag in splits.

Why bother? Smaller DAGs mean less scheduler strain, dodging thread errors entirely.

Monitoring and Troubleshooting Snakemake Thread Exhaustion

Spot issues early. Before runs:
ulimit -a | grep process — Check limits.
cat /proc/sys/kernel/threads-max — System max.

During: htop or ps -eLf | grep snakemake | wc -l for thread counts. If spiking near ulimit, throttle more.

Logs? --debug --stats for DAG/job insights. Still failing? Reddit thread on Python limits pegs ~1000 as common wall. Docker? Upgrade base image, as in forums.

Incremental tests: 10 studies, 100, then all. You'll nail it.

Sources
Snakemake RuntimeError can't start new thread — Exact traceback match for large workflow failure at 98%: https://stackoverflow.com/questions/79867047/snakemake-runtimeerror-cant-start-new-thread
Error: can't start new thread — Explains Python process thread limits and ulimit checks: https://stackoverflow.com/questions/1834919/error-cant-start-new-thread
Snakemake CLI Arguments — Official docs on --max-threads, --jobs, --cores usage: https://snakemake.readthedocs.io/en/stable/executing/cli.html
RuntimeError: can't start new thread — Linux diagnostics with ulimit -a and thread counting: https://superuser.com/questions/1682148/runtimeerror-cant-start-new-thread
Extreme scheduler strain in large workflows — GitHub discussion on 40k-job DAG overhead: https://github.com/snakemake/snakemake/issues/2354
Snakemake Tutorial: Advanced Features — Guidance on threads, resources, and checkpoints: https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html
RuntimeError: can't start new thread (Docker) — Container-specific threading fixes: https://forums.docker.com/t/runtimeerror-cant-start-new-thread/138142

Conclusion

Tame the Snakemake RuntimeError: can't start new thread in large workflows by slashing --max-threads 1, hiking ulimit -u 4096, and trimming --jobs to 2 while keeping --cores higher for throughput. Add checkpoints to tame DAG bloat, monitor threads religiously, and test incrementally—you'll process those thousands of studies smoothly. These tweaks turn failures into reliable pipelines; give the updated command a spin and watch it finish.

Flag	Role	Recommendation for Large Runs
`--jobs`	Concurrent jobs	2‑4 (lower = fewer scheduler threads)
`--cores`	Total CPUs	Match your machine (e.g., 12)
`--max-threads`	Per‑job threads	1 (eliminates intra‑job parallelism)
`--resources`	Custom limits	Add `io=2` for downloads, scale `mem_mb`