Create Custom Hydra Launcher Plugin for Task-Spooler
Build a custom Hydra (hydra.cc) launcher plugin to submit multirun jobs to task-spooler, return results with exit codes. Includes APIs, minimal code example, registration via entry points for PyTorch Lightning workflows.
How do I create a custom Hydra (hydra.cc) Launcher plugin to submit multirun jobs to task-spooler and return results to Hydra?
Background:
- I use a fork of task-spooler that can track GPU usage.
- My projects use PyTorch Lightning and Hydra to configure experiments.
- Hydra’s multirun supports custom launchers; I want to enqueue runs into task-spooler.
Problems:
- I can’t find example launcher plugins, templates, or developer documentation for writing a launcher plugin.
- Existing launcher plugins in the hydra-core repository look complex and it’s unclear how they return values/results to the parent process.
Specific questions:
- Are there minimal example launcher plugins, templates, or tutorials for creating a custom Hydra launcher?
- Which hydra-core APIs, classes, or lifecycle hooks must I implement, and how do I register the plugin?
- How should a custom launcher return job results/status to Hydra multirun (e.g., return objects, exit codes, callbacks)?
- Can you provide a minimal, runnable example that enqueues a job in task-spooler and returns its result to Hydra?
Any pointers, code snippets, or step-by-step guidance to get started would be appreciated.
You can implement a custom hydra launcher by subclassing hydra.core.launcher.Launcher, implementing launch(...) to enqueue each multirun job into task-spooler (tsp), and registering the class with an entry-point so Hydra discovers it. For a minimal, reliable start: enqueue the command with tsp, have the job write its exit code to a known result file, poll tsp (or the result file) until completion, then return the per-job result (or job ID) back to Hydra — the submitit and ray plugins follow the same pattern of returning handles or results to Hydra.
Contents
- Hydra launcher overview and required APIs
- Which hydra-core APIs and lifecycle hooks to implement
- How to return job results/status to Hydra multirun
- Minimal, runnable Task-Spooler (tsp) Launcher example
- Packaging & registration (entry_points)
- Debugging, testing and tips for PyTorch Lightning workflows
- Advanced: asynchronous patterns and scaling
- Sources
- Conclusion
Hydra launcher overview and required APIs
Hydra’s launcher plugins are the piece that actually starts jobs generated by a multirun. In practice a launcher:
- receives a batch of “jobs” (each job is a set of Hydra override arguments),
- composes a command line for each job,
- launches the command in the target environment, and
- returns a per-job handle/result (IDs, futures, exit codes, etc.).
The official plugin-development doc explains the lifecycle and the basic contract: implement the Launcher base class and its launch method. See the Hydra plugin development docs for the canonical guidance: https://hydra.cc/docs/advanced/plugins/develop/ and the overview page for launchers: https://hydra.cc/docs/advanced/plugins/overview/.
Why subclass Hydra’s Launcher? Because it standardizes how Hydra hands you the multirun jobs and how Hydra expects to receive whatever you return (IDs, refs, or results). The Submitit and Ray plugins are useful references — they show two common patterns (enqueue-and-return-ID vs. return future-like refs): https://hydra.cc/docs/plugins/submitit_launcher/ and https://hydra.cc/docs/plugins/ray_launcher/.
Which hydra-core APIs and lifecycle hooks must I implement, and how do I register the plugin?
What to implement
- Subclass: implement
hydra.core.launcher.Launcher(the docs call this the Launcher base class). - Primary method:
launch(self, jobs: List[<overrides>]) -> List[Optional[str]]— this is where you build the per-job command and submit it to tsp. The returned list should align with the input job order (commonly job IDs, futures, or final results). The docs describe this signature and purpose: https://hydra.cc/docs/advanced/plugins/develop/.
What “jobs” contains
- Hydra passes each job as a set of overrides (often a list of strings like
["dataset=mnist", "lr=0.001"]). Your launcher composes the actual executable invocation from those overrides (a common pattern is to re-invoke the app with the overrides).
Plugin registration (entry points)
-
Make your package discoverable by registering an entry point in the
hydra.launchergroup. Example (setup.cfg / pyproject): -
setup.cfg / setup.py style:
[options.entry_points]
hydra.launcher =
task_spooler = hydra_tsp_launcher.tsp_launcher:TaskSpoolerLauncher
- pyproject.toml (flit/poetry) snippet:
[tool.poetry.plugins."hydra.launcher"]
task_spooler = "hydra_tsp_launcher.tsp_launcher:TaskSpoolerLauncher"
Run-time selection
- After installation you can choose the launcher at run time, for example:
- CLI override: python train.py --multirun … hydra/launcher=task_spooler
- Or in config: hydra/launcher: target: hydra_tsp_launcher.tsp_launcher:TaskSpoolerLauncher
See “Configuring Plugins” for examples and options: https://hydra.cc/docs/patterns/configuring_plugins/.
How should a custom launcher return job results/status to Hydra multirun?
There are two pragmatic patterns you’ll see in the official plugins:
- Async pattern (enqueue + return handle)
- What it does:
launch()enqueues work and returns a handle (job ID string, submitit Job object, Ray object ref, etc.). Hydra records the handle in its job table. Plugins that return handles leave it to the plugin or user code to map handles to final outputs. - Pros: non-blocking, scalable.
- Cons: Hydra core doesn’t magically know how to turn an arbitrary handle into an exit code — the plugin (or an external monitor) must provide that mapping if you want Hydra to surface final statuses.
- Sync (blocking) pattern — simplest to get working
- What it does:
launch()enqueues the job, then blocks (polls) until the job completes, reads exit code/stdout/stderr (or a small result file the job writes), and returns that final result to Hydra. - Pros: simple, Hydra immediately sees the final status and result for each job.
- Cons: not fully asynchronous; parent process waits until the queued job runs and finishes.
How official plugins do it
- Submitit returns submitit job objects / ids; Hydra and the plugin work together to populate the job table and retrieve exit codes when needed: https://hydra.cc/docs/plugins/submitit_launcher/.
- Ray launcher returns object refs (futures) and calls ray.get() to collect actual results: https://hydra.cc/docs/plugins/ray_launcher/.
Practical recommendation for your Task-Spooler launcher
- Start with a blocking (sync) implementation: enqueue with tsp, have the enqueued command write its exit code to a known file (passed as an environment variable or encoded into the command), poll until the job finishes, then read that file and return its contents to Hydra. That pattern is simple, robust, and portable across different tsp forks.
- When you want async behavior, move to a return-of-ID pattern and implement a small monitor/agent (or let your CI/experiment manager query tsp and hydrate Hydra’s job table externally).
Minimal, runnable Task-Spooler (tsp) Launcher example
Quick notes before the code:
- This example uses a conservative, easy-to-understand approach: enqueue a wrapped shell command into tsp, instruct the job to write its exit code to a temporary result file, poll until completion, then return the exit code to Hydra. That makes the plugin work out of the box and lets Hydra see per-job results.
- You’ll need your app module name (or a stable way to re-invoke the script). The example reads an env var
HYDRA_TSP_APP_MODULE— set that to your app’s module (e.g.,my_project.train) or adapt the command composition to your environment. - Adjust paths and polling intervals to suit your setup. This code is intentionally small so you can iterate.
tsp_launcher.py (minimal)
# hydra_tsp_launcher/tsp_launcher.py
import os
import sys
import shlex
import time
import uuid
import subprocess
from typing import Sequence, List, Optional, Any, Dict
from hydra.core.launcher import Launcher
class TaskSpoolerLauncher(Launcher):
"""
Minimal Task-Spooler launcher:
- Expects each job as a sequence/list of overrides (e.g. ["dataset=mnist","lr=0.001"])
- Re-invokes Python with module from HYDRA_TSP_APP_MODULE or falls back to sys.argv[0]
- Enqueues a wrapped command to tsp which writes exit code to a result file.
"""
def __init__(self, poll_interval: float = 1.0, **kwargs):
super().__init__()
self.poll_interval = float(poll_interval)
def _compose_cmd(self, overrides: Sequence[str]) -> str:
# How to call your app: either set env var HYDRA_TSP_APP_MODULE or rely on the script path
app_module = os.environ.get("HYDRA_TSP_APP_MODULE")
if app_module:
base = f"{shlex.quote(sys.executable)} -m {shlex.quote(app_module)}"
else:
# fallback: re-run the same script that started Hydra
script = sys.argv[0]
base = f"{shlex.quote(sys.executable)} {shlex.quote(script)}"
# overrides are passed as CLI args; make sure they're shell-quoted
args = " ".join(shlex.quote(str(o)) for o in overrides)
return f"{base} {args}".strip()
def launch(self, jobs: Sequence[Sequence[str]]) -> List[Optional[str]]:
results: List[Optional[str]] = []
for overrides in jobs:
cmd_core = self._compose_cmd(overrides)
token = uuid.uuid4().hex
result_file = f"/tmp/hydra_tsp_result_{token}.txt"
# wrap command so the job writes its exit code on completion
# we force bash -lc so redirections and quoting behave predictably
wrapped = f"bash -lc '{cmd_core}; echo $? > {shlex.quote(result_file)}'"
# enqueue to tsp; tsp usually takes the whole command as a single argument
p = subprocess.run(["tsp", wrapped], capture_output=True, text=True)
stdout = (p.stdout or "").strip()
# try to parse a numeric job id from tsp output (best-effort)
job_id = None
if stdout:
import re
m = re.search(r"(\d+)", stdout)
if m:
job_id = m.group(1)
# Poll until the result file exists (job finished) OR until tsp -l no longer lists the job id
while True:
if os.path.exists(result_file):
break
# fallback: if we have a job_id, check tsp -l
if job_id is not None:
list_p = subprocess.run(["tsp", "-l"], capture_output=True, text=True)
if job_id not in (list_p.stdout or ""):
# not in queue -> either finished or removed
# wait briefly for result file to appear
time.sleep(0.2)
if os.path.exists(result_file):
break
else:
# still no result file; treat as unknown
break
time.sleep(self.poll_interval)
# Read exit code (best-effort)
exit_code = None
try:
with open(result_file, "r") as f:
exit_code = int(f.read().strip())
except Exception:
exit_code = None
# cleanup result file
try:
os.remove(result_file)
except Exception:
pass
# Return a simple string: prefer numeric exit code, else the tsp job id
if exit_code is not None:
results.append(str(exit_code))
else:
results.append(job_id)
return results
How to use (quick)
- Add an env var so the launcher knows how to re-run your Hydra app:
- export HYDRA_TSP_APP_MODULE=my_project.train
- Or modify _compose_cmd() to fit your project.
- Package & install the plugin (see packaging below) or add it to PYTHONPATH.
- Run a multirun selecting the launcher:
- python train.py --multirun lr=1e-3,1e-4 dataset=mnist,cifar hydra/launcher=task_spooler
This minimal flow will enqueue each run, wait for completion (via result-file polling), and return the exit code string to Hydra. After you validate this, you can evolve the launcher to be fully asynchronous.
Packaging & registration (entry_points)
A minimal setup for setuptools (setup.cfg):
[metadata]
name = hydra_tsp_launcher
version = 0.0.1
...
[options]
packages = find:
[options.entry_points]
hydra.launcher =
task_spooler = hydra_tsp_launcher.tsp_launcher:TaskSpoolerLauncher
Install in editable mode during development:
- pip install -e .
After installation, Hydra will discover the task_spooler launcher name. You can then select it with hydra/launcher=task_spooler on the CLI or in your Hydra config.
For more plugin patterns and config examples, see Hydra’s plugin configuration docs: https://hydra.cc/docs/patterns/configuring_plugins/.
Debugging, testing and tips for PyTorch Lightning workflows
- Start small: write a trivial script
hello.pythat prints and exits; test the launcher with 2-3 quick jobs before hooking into your heavy PL training runs. - Use the result-file approach: it’s the simplest robust contract between parent and queued child. Your tsp fork can track GPUs, but have the child job itself write the final status to a file your launcher reads.
- Make the job command deterministic: include
--multirun-style overrides exactly as strings. If your job depends on working dir, setcwdexplicitly in the wrapped command. - If you’re using GPU tracking in your tsp fork, you can optionally pass a
--gpu-reqparameter or environment variable into the wrapper command to let tsp schedule appropriately. - Logging and artifacts: have the child write logs to the run directory (Hydra run dir or a path your launcher passes in). That makes artifact collection trivial.
Advanced: asynchronous patterns and scaling
Want non-blocking behavior? Two common routes:
- Return job IDs and provide a separate monitor/collector
launch()returns tsp job IDs immediately. A background monitor process (or a short-livedhydra-tsp-monitorCLI) pollstsp -land the job output files, and writes final results to a shared store (Redis, file, database). Hydra can be extended (or a custom UI) to show final statuses by reading that store.
- Return future-like objects (if your runtime has them)
- If you can return an object Hydra and your environment understand (like a Ray ObjectRef used in the Ray plugin), you can rely on that runtime’s API to get results. See the Ray plugin for an example: https://hydra.cc/docs/plugins/ray_launcher/.
Community pointer: there’s an open discussion about adding a Task-Spooler launcher in the Hydra project (useful for inspiration): https://github.com/facebookresearch/hydra/issues/715.
Sources
- Plugin development overview: https://hydra.cc/docs/advanced/plugins/develop/
- Plugins overview (launchers): https://hydra.cc/docs/advanced/plugins/overview/
- Submitit Launcher plugin (example of enqueue + id): https://hydra.cc/docs/plugins/submitit_launcher/
- Ray Launcher plugin (example of future-like refs): https://hydra.cc/docs/plugins/ray_launcher/
- Joblib Launcher (parallel/local launcher example): https://hydra.cc/docs/plugins/joblib_launcher/
- Configuring plugins (entry points / config): https://hydra.cc/docs/patterns/configuring_plugins/
- Multirun tutorial (context for sweep & launchers): https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/
- Hydra GitHub repository (source examples and other plugins): https://github.com/facebookresearch/hydra
- Task-spooler plugin request / community discussion: https://github.com/facebookresearch/hydra/issues/715
Conclusion
You can get a working hydra launcher for task-spooler quickly by subclassing hydra.core.launcher.Launcher, registering the class under the hydra.launcher entry point, and using the simple pattern shown above: enqueue a job with tsp, have the job write a small result file, poll until completion, then return the exit code (or job id) to Hydra. Start with the blocking/result-file approach to validate correctness; once that works, iterate toward async handles/monitors for scale. The Hydra docs and the submitit/ray plugins are great references while you build your custom hydra launcher.