Programming

Troubleshoot Dask: Client Connects but No Workers Available

Dask client connects but reports no workers - troubleshoot scheduler, worker registration, network/TLS, version mismatches, autoscaling, and check worker logs.

1 answer 1 view

Dask client connects successfully but shows no workers available

I am using Dask for file processing. The client starts successfully and connects to tls://localhost:xxxx, but zero workers are available, so tasks like client.map() and client.gather() do not run.

Client creation:

python
client = Client("tls://localhost:xxxx")

Processing code:

python
start = time.perf_counter()

print(f"Submitting {len(root_files)} files to the cluster..")

futures = client.map(process_file, root_files)
results = client.gather(futures)

all_masses = np.concatenate(results)

elapsed = time.perf_counter() - start

print(f"Total events processed: {len(all_masses)}")
print(f"Processing time: {elapsed:.2f} s")

What could be causing no workers to register or become available in the Dask cluster? How can I troubleshoot and fix this issue?

Most commonly this happens because workers never register with the scheduler — either they haven’t started yet, can’t reach the scheduler (network/TLS/addressing), are started but immediately die (version or environment mismatch, exceptions), or your cluster manager has scaled to zero. Start by checking scheduler reachability, verifying worker processes and logs, and using client.wait_for_workers(1) or client.scheduler_info() to see whether any workers are actually registered.

Contents

Quick checklist: Dask no workers

  • Does client.scheduler_info()['workers'] show any entries?
  • Is the scheduler process running and reachable (port and dashboard)?
  • Are worker processes running on the expected hosts/nodes? Inspect their logs.
  • Are TLS/protocol settings consistent (client uses tls:// but workers started without TLS)?
  • Are you using a cluster manager (SLURM, Kubernetes, Dask-Gateway) that may be queuing/limiting workers or scaling to zero?

If you haven’t checked any of the above, stop and do those first — you’ll resolve most “no workers” cases quickly.

Confirm scheduler and worker processes

Verify the scheduler is running and listening on the address you think it is:

  • Check scheduler process: ps aux | grep dask-scheduler (or systemd/journalctl if started as a service).
  • Confirm the dashboard is reachable (default web UI on port 8787): open http://<scheduler-host>:8787 or inspect client.dashboard_link.
  • From your Python client run quick introspection:
python
from distributed import Client
client = Client("tls://localhost:xxxx")
print("dashboard:", client.dashboard_link)
print("scheduler_info:", client.scheduler_info())   # dict with 'workers'
print("ncores:", client.ncores())                  # mapping of worker → cores

If client.scheduler_info()['workers'] is {} then the scheduler does not see any registered workers. The Dask scheduler state docs explain how scheduler state reflects registration/queued tasks.

If the workers should be running on the same host, run ps aux | grep dask-worker or check the worker CLI terminal. If workers are on remote hosts, ssh into a node and inspect dask-worker processes and their logs.

Useful quick test: start a scheduler and a worker locally (two terminals) to confirm a minimal working setup:

Terminal A:

bash
dask-scheduler --port 8786
# (dashboard will be at :8787)

Terminal B:

bash
dask-worker tcp://localhost:8786

Then run your client against tcp://localhost:8786 to verify the worker registers.

The official distributed worker documentation describes lifecycle and how workers register with the scheduler.

Network, address binding and TLS issues that cause Dask no workers

Common network/addressing/TLS causes:

  • Scheduler bound to loopback (127.0.0.1) while workers run on different hosts. If the scheduler advertises only localhost, remote workers cannot connect. Start the scheduler with an advertise address, host, or interface: e.g. dask-scheduler --host 0.0.0.0 or --interface eth0, or use a --scheduler-file so clients/workers use the same address.
  • Client using tls:// while workers were started on tcp:// (or vice versa). Make sure scheduler and all workers use the same protocol and certificate authority. Start scheduler/workers with the same TLS options (--tls-ca-file, --tls-cert, --tls-key).
  • Firewall/NAT: ports blocked between worker nodes and scheduler (8786/tcp for scheduler RPC; 8787 for dashboard). Test with nc -vz <scheduler-host> 8786, telnet, or a small socket program (e.g. python -c 'import socket; socket.create_connection(("host",8786),5)').
  • DNS issues for cloud/kubernetes setups: worker pods exist but cannot resolve the scheduler service name.

If you see workers created in your cluster manager but the scheduler report stays empty, test connectivity from a worker host to the scheduler with nc/curl/telnet and inspect worker logs for connection refused or TLS handshake errors. For Kubernetes and similar environments look at pod logs (kubectl logs <pod>). Real-world reports of this pattern appear in the Dask Kubernetes and jobqueue issue trackers: see the dask-kubernetes issue about worker timeouts and the dask-jobqueue SLURM issue.

Process model, imports and pickling problems

If workers start and then immediately disconnect or you see odd import errors in their logs, consider how your function process_file is defined and how Python spawns worker processes:

  • Don’t rely on __main__-only functions: put process_file in an importable module (file) and import it from your script. Cloudpickle can serialize functions, but workers must be able to import the module name used by the function.
  • On Windows or when using the spawn start method you must protect script entry points:
python
if __name__ == "__main__":
    from distributed import Client
    client = Client("tls://localhost:xxxx")
    # run submit/map code here

StackOverflow threads explain how missing if __name__ == "__main__": guards and incorrect process startup can leave workers unstarted or repeatedly re-importing the client creation code: https://stackoverflow.com/questions/55149250/dask-not-starting-workers.

Version and environment mismatches that make workers die

A very frequent cause: worker processes start but die silently because messages cannot be unpacked (protocol/serialization/version mismatch) or because required libraries are missing. Check:

  • Versions on scheduler, client and workers:
bash
python -c "import dask, distributed; print(dask.__version__, distributed.__version__)"

Run that on the machine running the client and on each worker node. The Why did my worker die? page lists version/environment mismatches as a common cause. Also verify critical packages used for serialization (cloudpickle, msgpack/pyarrow if relevant) match.

If versions differ, align them (pip/conda install the same versions). If workers crash with tracebacks, examine those logs — they’ll often show an ImportError, AttributeError, or deserialization error.

Cluster managers and autoscaling pitfalls

If you use SLURM, Kubernetes, Dask-Gateway, or dask-jobqueue:

Logs, commands and practical troubleshooting steps

Step-by-step checklist you can run now:

  1. Inspect scheduler and worker status:
    • client = Client("tls://localhost:xxxx")
    • client.scheduler_info() and client.ncores()
  2. Wait for workers (if cluster starting them):
python
client.wait_for_workers(1, timeout=30)   # blocks until 1 worker registers or times out
  1. Check connectivity from a worker node to the scheduler:
    • nc -vz <scheduler-host> 8786 or telnet <scheduler-host> 8786
  2. Inspect worker logs for errors (on node, systemd, or via kubectl logs).
  3. Verify versions on client, scheduler and workers:
bash
python -c "import dask, distributed; print(dask.__version__, distributed.__version__)"
  1. If using TLS, reproduce with non-TLS locally to isolate config issues: start scheduler and worker locally with tcp:// to confirm code works end-to-end.
  2. If using cluster manager, ensure minimum workers > 0 or scale the cluster manually:
python
cluster.scale(2)   # or cluster.adapt(minimum=1, maximum=10)
  1. Try simple functions first (no heavy dependencies) to confirm worker registration before submitting your full process_file workload.

If you see errors like “No workers found” in the scheduler logs (distributed.core - ERROR - No workers found), that often means tasks were queued before workers registered — either because the cluster was still spinning up or because workers could not register. See the GitHub issue where this exact symptom appears for jobqueue/SLURM users: https://github.com/dask/distributed/issues/2941 and the scatter-specific issue https://github.com/dask/distributed/issues/2454.

Common error messages and next steps

  • TimeoutError: No workers found — wait for workers (client.wait_for_workers) or scale cluster to >0.
  • Worker connection refused / TLS handshake failure — check protocol and certs on scheduler and workers.
  • Worker starts then disappears — inspect worker logs for exceptions; check version mismatches per the killed docs.
  • Dashboard shows workers but client.map hangs — confirm the client is connected to the correct scheduler instance (addresses match) and that worker resources are not all busy/blocked.

Conclusion

Dask no workers usually means workers never registered or they registered and then died — most often due to networking/TLS/address binding issues, cluster autoscaling settings, or environment/version mismatches. Use client.scheduler_info() and client.wait_for_workers(1) to confirm registration, inspect worker logs and the dashboard, test network connectivity from worker hosts to the scheduler, and ensure scheduler/worker TLS and package versions match. Once you confirm one worker can register, your client.map() / client.gather() calls will start running normally.

Sources

If you want, paste the exact output of client.scheduler_info() and a short excerpt of the worker log (or kubectl logs / journalctl output) and I’ll point to the most likely error lines and the exact next command to run.

Authors
Verified by moderation
Moderation
Troubleshoot Dask: Client Connects but No Workers Available