Troubleshoot Dask: Client Connects but No Workers Available
Dask client connects but reports no workers - troubleshoot scheduler, worker registration, network/TLS, version mismatches, autoscaling, and check worker logs.
Dask client connects successfully but shows no workers available
I am using Dask for file processing. The client starts successfully and connects to tls://localhost:xxxx, but zero workers are available, so tasks like client.map() and client.gather() do not run.
Client creation:
client = Client("tls://localhost:xxxx")
Processing code:
start = time.perf_counter()
print(f"Submitting {len(root_files)} files to the cluster..")
futures = client.map(process_file, root_files)
results = client.gather(futures)
all_masses = np.concatenate(results)
elapsed = time.perf_counter() - start
print(f"Total events processed: {len(all_masses)}")
print(f"Processing time: {elapsed:.2f} s")
What could be causing no workers to register or become available in the Dask cluster? How can I troubleshoot and fix this issue?
Most commonly this happens because workers never register with the scheduler — either they haven’t started yet, can’t reach the scheduler (network/TLS/addressing), are started but immediately die (version or environment mismatch, exceptions), or your cluster manager has scaled to zero. Start by checking scheduler reachability, verifying worker processes and logs, and using client.wait_for_workers(1) or client.scheduler_info() to see whether any workers are actually registered.
Contents
- Quick checklist: Dask no workers
- Confirm scheduler and worker processes
- Network, address binding and TLS issues that cause Dask no workers
- Process model, imports and pickling problems
- Version / environment mismatches and dying workers
- Cluster managers, autoscaling and job queues
- Logs, commands and practical troubleshooting steps
- Common error messages and next steps
- Conclusion
- Sources
Quick checklist: Dask no workers
- Does
client.scheduler_info()['workers']show any entries? - Is the scheduler process running and reachable (port and dashboard)?
- Are worker processes running on the expected hosts/nodes? Inspect their logs.
- Are TLS/protocol settings consistent (client uses
tls://but workers started without TLS)? - Are you using a cluster manager (SLURM, Kubernetes, Dask-Gateway) that may be queuing/limiting workers or scaling to zero?
If you haven’t checked any of the above, stop and do those first — you’ll resolve most “no workers” cases quickly.
Confirm scheduler and worker processes
Verify the scheduler is running and listening on the address you think it is:
- Check scheduler process:
ps aux | grep dask-scheduler(or systemd/journalctl if started as a service). - Confirm the dashboard is reachable (default web UI on port 8787): open
http://<scheduler-host>:8787or inspectclient.dashboard_link. - From your Python client run quick introspection:
from distributed import Client
client = Client("tls://localhost:xxxx")
print("dashboard:", client.dashboard_link)
print("scheduler_info:", client.scheduler_info()) # dict with 'workers'
print("ncores:", client.ncores()) # mapping of worker → cores
If client.scheduler_info()['workers'] is {} then the scheduler does not see any registered workers. The Dask scheduler state docs explain how scheduler state reflects registration/queued tasks.
If the workers should be running on the same host, run ps aux | grep dask-worker or check the worker CLI terminal. If workers are on remote hosts, ssh into a node and inspect dask-worker processes and their logs.
Useful quick test: start a scheduler and a worker locally (two terminals) to confirm a minimal working setup:
Terminal A:
dask-scheduler --port 8786
# (dashboard will be at :8787)
Terminal B:
dask-worker tcp://localhost:8786
Then run your client against tcp://localhost:8786 to verify the worker registers.
The official distributed worker documentation describes lifecycle and how workers register with the scheduler.
Network, address binding and TLS issues that cause Dask no workers
Common network/addressing/TLS causes:
- Scheduler bound to loopback (127.0.0.1) while workers run on different hosts. If the scheduler advertises only localhost, remote workers cannot connect. Start the scheduler with an advertise address, host, or interface: e.g.
dask-scheduler --host 0.0.0.0or--interface eth0, or use a--scheduler-fileso clients/workers use the same address. - Client using
tls://while workers were started ontcp://(or vice versa). Make sure scheduler and all workers use the same protocol and certificate authority. Start scheduler/workers with the same TLS options (--tls-ca-file,--tls-cert,--tls-key). - Firewall/NAT: ports blocked between worker nodes and scheduler (8786/tcp for scheduler RPC; 8787 for dashboard). Test with
nc -vz <scheduler-host> 8786,telnet, or a small socket program (e.g.python -c 'import socket; socket.create_connection(("host",8786),5)'). - DNS issues for cloud/kubernetes setups: worker pods exist but cannot resolve the scheduler service name.
If you see workers created in your cluster manager but the scheduler report stays empty, test connectivity from a worker host to the scheduler with nc/curl/telnet and inspect worker logs for connection refused or TLS handshake errors. For Kubernetes and similar environments look at pod logs (kubectl logs <pod>). Real-world reports of this pattern appear in the Dask Kubernetes and jobqueue issue trackers: see the dask-kubernetes issue about worker timeouts and the dask-jobqueue SLURM issue.
Process model, imports and pickling problems
If workers start and then immediately disconnect or you see odd import errors in their logs, consider how your function process_file is defined and how Python spawns worker processes:
- Don’t rely on
__main__-only functions: putprocess_filein an importable module (file) and import it from your script. Cloudpickle can serialize functions, but workers must be able to import the module name used by the function. - On Windows or when using the spawn start method you must protect script entry points:
if __name__ == "__main__":
from distributed import Client
client = Client("tls://localhost:xxxx")
# run submit/map code here
StackOverflow threads explain how missing if __name__ == "__main__": guards and incorrect process startup can leave workers unstarted or repeatedly re-importing the client creation code: https://stackoverflow.com/questions/55149250/dask-not-starting-workers.
Version and environment mismatches that make workers die
A very frequent cause: worker processes start but die silently because messages cannot be unpacked (protocol/serialization/version mismatch) or because required libraries are missing. Check:
- Versions on scheduler, client and workers:
python -c "import dask, distributed; print(dask.__version__, distributed.__version__)"
Run that on the machine running the client and on each worker node. The Why did my worker die? page lists version/environment mismatches as a common cause. Also verify critical packages used for serialization (cloudpickle, msgpack/pyarrow if relevant) match.
If versions differ, align them (pip/conda install the same versions). If workers crash with tracebacks, examine those logs — they’ll often show an ImportError, AttributeError, or deserialization error.
Cluster managers and autoscaling pitfalls
If you use SLURM, Kubernetes, Dask-Gateway, or dask-jobqueue:
- The cluster manager may queue worker jobs and they may not run immediately (or resources may be exhausted). Check
squeue/qstat/kubectl get podsand the job logs on those systems. The Planetary Computer discussion shows quota/pooled-node issues generating this symptom: https://github.com/microsoft/PlanetaryComputer/discussions/178. - Adaptive/autoscaling configs can leave you with zero workers (e.g.,
minimum=0). Set a sensible minimum (1) so the scheduler always has at least one worker for profiling and task throughput; the Dask tutorial recommends keeping a minimum of 1 for responsive behavior: https://tutorial.dask.org/04_distributed.html. - Job lifecycle options like
death_timeout(used by some cluster setups) can cause workers to be killed if they can’t connect back quickly. See related discussions in the jobqueue repository: https://github.com/dask/dask-jobqueue/issues/20 and https://github.com/dask/distributed/issues/2941.
Logs, commands and practical troubleshooting steps
Step-by-step checklist you can run now:
- Inspect scheduler and worker status:
client = Client("tls://localhost:xxxx")client.scheduler_info()andclient.ncores()
- Wait for workers (if cluster starting them):
client.wait_for_workers(1, timeout=30) # blocks until 1 worker registers or times out
- Check connectivity from a worker node to the scheduler:
nc -vz <scheduler-host> 8786ortelnet <scheduler-host> 8786
- Inspect worker logs for errors (on node, systemd, or via
kubectl logs). - Verify versions on client, scheduler and workers:
python -c "import dask, distributed; print(dask.__version__, distributed.__version__)"
- If using TLS, reproduce with non-TLS locally to isolate config issues: start scheduler and worker locally with
tcp://to confirm code works end-to-end. - If using cluster manager, ensure minimum workers > 0 or scale the cluster manually:
cluster.scale(2) # or cluster.adapt(minimum=1, maximum=10)
- Try simple functions first (no heavy dependencies) to confirm worker registration before submitting your full
process_fileworkload.
If you see errors like “No workers found” in the scheduler logs (distributed.core - ERROR - No workers found), that often means tasks were queued before workers registered — either because the cluster was still spinning up or because workers could not register. See the GitHub issue where this exact symptom appears for jobqueue/SLURM users: https://github.com/dask/distributed/issues/2941 and the scatter-specific issue https://github.com/dask/distributed/issues/2454.
Common error messages and next steps
- TimeoutError: No workers found — wait for workers (
client.wait_for_workers) or scale cluster to >0. - Worker connection refused / TLS handshake failure — check protocol and certs on scheduler and workers.
- Worker starts then disappears — inspect worker logs for exceptions; check version mismatches per the killed docs.
- Dashboard shows workers but
client.maphangs — confirm the client is connected to the correct scheduler instance (addresses match) and that worker resources are not all busy/blocked.
Conclusion
Dask no workers usually means workers never registered or they registered and then died — most often due to networking/TLS/address binding issues, cluster autoscaling settings, or environment/version mismatches. Use client.scheduler_info() and client.wait_for_workers(1) to confirm registration, inspect worker logs and the dashboard, test network connectivity from worker hosts to the scheduler, and ensure scheduler/worker TLS and package versions match. Once you confirm one worker can register, your client.map() / client.gather() calls will start running normally.
Sources
- https://distributed.dask.org/en/latest/worker.html
- https://distributed.dask.org/en/stable/killed.html
- https://distributed.dask.org/en/latest/faq.html
- https://distributed.dask.org/en/latest/scheduling-state.html
- https://tutorial.dask.org/04_distributed.html
- https://stackoverflow.com/questions/79854826/dask-client-connects-successfully-but-no-workers-are-available
- https://stackoverflow.com/questions/55012149/why-does-my-dask-client-show-zero-workers-cores-and-memory
- https://stackoverflow.com/questions/55149250/dask-not-starting-workers
- https://stackoverflow.com/questions/44664026/local-dask-scheduler-failing-to-connect-to-workers-on-remote-resource
- https://github.com/dask/dask-jobqueue/issues/20
- https://github.com/dask/dask-jobqueue/issues/246
- https://github.com/dask/dask-kubernetes/issues/176
- https://github.com/dask/distributed/issues/2941
- https://github.com/dask/distributed/issues/2454
- https://github.com/dask/dask/issues/3912
- https://dask.discourse.group/t/worker-pods-exist-but-client-cannot-connect-to-them-or-workers-do-not-accept-jobs/2822
- https://dask.discourse.group/t/no-jobs-sent-to-workers/2527
- https://github.com/microsoft/PlanetaryComputer/discussions/178
If you want, paste the exact output of client.scheduler_info() and a short excerpt of the worker log (or kubectl logs / journalctl output) and I’ll point to the most likely error lines and the exact next command to run.