Why does my Dask client connect to the scheduler but show no workers available, preventing client.map/gather from running?

Most often this means workers never register with the scheduler. Troubleshoot checklist: 1) Confirm scheduler reachability: open client.dashboard_link, run client.scheduler_info() and client.ncores(), and verify the scheduler process is running and listening on the expected host/port. 2) Inspect worker processes and logs on each node (ps aux, journalctl, kubectl logs) to see connection or startup errors. 3) Test network connectivity from worker hosts to the scheduler (nc -vz, telnet, or python socket) and ensure the scheduler advertises a reachable address (use --host, --interface, or a scheduler-file). 4) Check protocol/TLS consistency: use tcp:// vs tls:// consistently and supply matching CA/cert/key on scheduler and workers. 5) Match environments: ensure dask/distributed/cloudpickle and other serialization dependencies are the same across client, scheduler and workers. 6) For cluster managers and autoscalers, verify jobs are running (squeue/kubectl), minimum workers > 0, and death_timeout settings. 7) For local/Windows setups, protect entry points with if __name__ == '__main__' to avoid re-importing client creation in worker processes. Use client.wait_for_workers(1, timeout=30) to block until a worker registers; once a worker is registered client.map()/client.gather() will run. If problems persist, paste client.scheduler_info() and a short worker log excerpt for targeted help.

Programming

Troubleshoot Dask: Client Connects but No Workers Available

Dask client connects but reports no workers - troubleshoot scheduler, worker registration, network/TLS, version mismatches, autoscaling, and check worker logs.

1 answer• 1 view

12/26/2025, 04:28 PM

Dask client connects successfully but shows no workers available

I am using Dask for file processing. The client starts successfully and connects to tls://localhost:xxxx, but zero workers are available, so tasks like client.map() and client.gather() do not run.

Client creation:

python

client = Client("tls://localhost:xxxx")

Processing code:

python

start = time.perf_counter()

print(f"Submitting {len(root_files)} files to the cluster..")

futures = client.map(process_file, root_files)
results = client.gather(futures)

all_masses = np.concatenate(results)

elapsed = time.perf_counter() - start

print(f"Total events processed: {len(all_masses)}")
print(f"Processing time: {elapsed:.2f} s")

What could be causing no workers to register or become available in the Dask cluster? How can I troubleshoot and fix this issue?

Most commonly this happens because workers never register with the scheduler — either they haven’t started yet, can’t reach the scheduler (network/TLS/addressing), are started but immediately die (version or environment mismatch, exceptions), or your cluster manager has scaled to zero. Start by checking scheduler reachability, verifying worker processes and logs, and using client.wait_for_workers(1) or client.scheduler_info() to see whether any workers are actually registered.

Quick checklist: Dask no workers
Confirm scheduler and worker processes
Network, address binding and TLS issues that cause Dask no workers
Process model, imports and pickling problems
Version / environment mismatches and dying workers
Cluster managers, autoscaling and job queues
Logs, commands and practical troubleshooting steps
Common error messages and next steps
Conclusion
Sources

Quick checklist: Dask no workers

Does client.scheduler_info()['workers'] show any entries?
Is the scheduler process running and reachable (port and dashboard)?
Are worker processes running on the expected hosts/nodes? Inspect their logs.
Are TLS/protocol settings consistent (client uses tls:// but workers started without TLS)?
Are you using a cluster manager (SLURM, Kubernetes, Dask-Gateway) that may be queuing/limiting workers or scaling to zero?

If you haven’t checked any of the above, stop and do those first — you’ll resolve most “no workers” cases quickly.

Confirm scheduler and worker processes

Verify the scheduler is running and listening on the address you think it is:

Check scheduler process: ps aux | grep dask-scheduler (or systemd/journalctl if started as a service).
Confirm the dashboard is reachable (default web UI on port 8787): open http://<scheduler-host>:8787 or inspect client.dashboard_link.
From your Python client run quick introspection:

python

from distributed import Client
client = Client("tls://localhost:xxxx")
print("dashboard:", client.dashboard_link)
print("scheduler_info:", client.scheduler_info())   # dict with 'workers'
print("ncores:", client.ncores())                  # mapping of worker → cores

If client.scheduler_info()['workers'] is {} then the scheduler does not see any registered workers. The Dask scheduler state docs explain how scheduler state reflects registration/queued tasks.

If the workers should be running on the same host, run ps aux | grep dask-worker or check the worker CLI terminal. If workers are on remote hosts, ssh into a node and inspect dask-worker processes and their logs.

Useful quick test: start a scheduler and a worker locally (two terminals) to confirm a minimal working setup:

Terminal A:

bash

dask-scheduler --port 8786
# (dashboard will be at :8787)

Terminal B:

bash

dask-worker tcp://localhost:8786

Then run your client against tcp://localhost:8786 to verify the worker registers.

The official distributed worker documentation describes lifecycle and how workers register with the scheduler.

Network, address binding and TLS issues that cause Dask no workers

Common network/addressing/TLS causes:

Scheduler bound to loopback (127.0.0.1) while workers run on different hosts. If the scheduler advertises only localhost, remote workers cannot connect. Start the scheduler with an advertise address, host, or interface: e.g. dask-scheduler --host 0.0.0.0 or --interface eth0, or use a --scheduler-file so clients/workers use the same address.
Client using tls:// while workers were started on tcp:// (or vice versa). Make sure scheduler and all workers use the same protocol and certificate authority. Start scheduler/workers with the same TLS options (--tls-ca-file, --tls-cert, --tls-key).
Firewall/NAT: ports blocked between worker nodes and scheduler (8786/tcp for scheduler RPC; 8787 for dashboard). Test with nc -vz <scheduler-host> 8786, telnet, or a small socket program (e.g. python -c 'import socket; socket.create_connection(("host",8786),5)').
DNS issues for cloud/kubernetes setups: worker pods exist but cannot resolve the scheduler service name.

If you see workers created in your cluster manager but the scheduler report stays empty, test connectivity from a worker host to the scheduler with nc/curl/telnet and inspect worker logs for connection refused or TLS handshake errors. For Kubernetes and similar environments look at pod logs (kubectl logs <pod>). Real-world reports of this pattern appear in the Dask Kubernetes and jobqueue issue trackers: see the dask-kubernetes issue about worker timeouts and the dask-jobqueue SLURM issue.

Process model, imports and pickling problems

If workers start and then immediately disconnect or you see odd import errors in their logs, consider how your function process_file is defined and how Python spawns worker processes:

Don’t rely on __main__-only functions: put process_file in an importable module (file) and import it from your script. Cloudpickle can serialize functions, but workers must be able to import the module name used by the function.
On Windows or when using the spawn start method you must protect script entry points:

python

if __name__ == "__main__":
    from distributed import Client
    client = Client("tls://localhost:xxxx")
    # run submit/map code here

StackOverflow threads explain how missing if __name__ == "__main__": guards and incorrect process startup can leave workers unstarted or repeatedly re-importing the client creation code: https://stackoverflow.com/questions/55149250/dask-not-starting-workers.

Version and environment mismatches that make workers die

A very frequent cause: worker processes start but die silently because messages cannot be unpacked (protocol/serialization/version mismatch) or because required libraries are missing. Check:

Versions on scheduler, client and workers:

bash

python -c "import dask, distributed; print(dask.__version__, distributed.__version__)"

Run that on the machine running the client and on each worker node. The Why did my worker die? page lists version/environment mismatches as a common cause. Also verify critical packages used for serialization (cloudpickle, msgpack/pyarrow if relevant) match.

If versions differ, align them (pip/conda install the same versions). If workers crash with tracebacks, examine those logs — they’ll often show an ImportError, AttributeError, or deserialization error.

Cluster managers and autoscaling pitfalls

If you use SLURM, Kubernetes, Dask-Gateway, or dask-jobqueue:

The cluster manager may queue worker jobs and they may not run immediately (or resources may be exhausted). Check squeue/qstat/kubectl get pods and the job logs on those systems. The Planetary Computer discussion shows quota/pooled-node issues generating this symptom: https://github.com/microsoft/PlanetaryComputer/discussions/178.
Adaptive/autoscaling configs can leave you with zero workers (e.g., minimum=0). Set a sensible minimum (1) so the scheduler always has at least one worker for profiling and task throughput; the Dask tutorial recommends keeping a minimum of 1 for responsive behavior: https://tutorial.dask.org/04_distributed.html.
Job lifecycle options like death_timeout (used by some cluster setups) can cause workers to be killed if they can’t connect back quickly. See related discussions in the jobqueue repository: https://github.com/dask/dask-jobqueue/issues/20 and https://github.com/dask/distributed/issues/2941.

Logs, commands and practical troubleshooting steps

Step-by-step checklist you can run now:

Inspect scheduler and worker status:
- client = Client("tls://localhost:xxxx")
- client.scheduler_info() and client.ncores()
Wait for workers (if cluster starting them):

python

client.wait_for_workers(1, timeout=30)   # blocks until 1 worker registers or times out

Check connectivity from a worker node to the scheduler:
- nc -vz <scheduler-host> 8786 or telnet <scheduler-host> 8786
Inspect worker logs for errors (on node, systemd, or via kubectl logs).
Verify versions on client, scheduler and workers:

bash

python -c "import dask, distributed; print(dask.__version__, distributed.__version__)"

If using TLS, reproduce with non-TLS locally to isolate config issues: start scheduler and worker locally with tcp:// to confirm code works end-to-end.
If using cluster manager, ensure minimum workers > 0 or scale the cluster manually:

python

cluster.scale(2)   # or cluster.adapt(minimum=1, maximum=10)

Try simple functions first (no heavy dependencies) to confirm worker registration before submitting your full process_file workload.

If you see errors like “No workers found” in the scheduler logs (distributed.core - ERROR - No workers found), that often means tasks were queued before workers registered — either because the cluster was still spinning up or because workers could not register. See the GitHub issue where this exact symptom appears for jobqueue/SLURM users: https://github.com/dask/distributed/issues/2941 and the scatter-specific issue https://github.com/dask/distributed/issues/2454.

Common error messages and next steps

TimeoutError: No workers found — wait for workers (client.wait_for_workers) or scale cluster to >0.
Worker connection refused / TLS handshake failure — check protocol and certs on scheduler and workers.
Worker starts then disappears — inspect worker logs for exceptions; check version mismatches per the killed docs.
Dashboard shows workers but client.map hangs — confirm the client is connected to the correct scheduler instance (addresses match) and that worker resources are not all busy/blocked.

Conclusion

Dask no workers usually means workers never registered or they registered and then died — most often due to networking/TLS/address binding issues, cluster autoscaling settings, or environment/version mismatches. Use client.scheduler_info() and client.wait_for_workers(1) to confirm registration, inspect worker logs and the dashboard, test network connectivity from worker hosts to the scheduler, and ensure scheduler/worker TLS and package versions match. Once you confirm one worker can register, your client.map() / client.gather() calls will start running normally.

Sources

If you want, paste the exact output of client.scheduler_info() and a short excerpt of the worker log (or kubectl logs / journalctl output) and I’ll point to the most likely error lines and the exact next command to run.

Authors

NeuroAnswers

Author

Verified by moderation

NeuroAnswers

Moderation

Troubleshoot Dask: Client Connects but No Workers Available

Contents

Quick checklist: Dask no workers

Confirm scheduler and worker processes

Network, address binding and TLS issues that cause Dask no workers

Process model, imports and pickling problems

Version and environment mismatches that make workers die

Cluster managers and autoscaling pitfalls

Logs, commands and practical troubleshooting steps

Common error messages and next steps

Conclusion

Sources