Resources¶
It is nearly always preferred to run steps in parallel when possible, so StepUp will launch any queued step as soon as an execution slot becomes available. A “resource” is a named quantity that limits how many steps can run concurrently in cases where full parallelization would be counterproductive:
-
Some programs behave poorly (have bugs) when multiple instances are running in parallel. Here are a few examples encountered (and now also resolved) in the development of StepUp RepRep:
-
Some steps may consume a lot of resources, such as memory or GPU compute, requiring more resources than available when running in parallel.
-
Some software licenses may limit the number of instances running in parallel.
Available resources are declared via the STEPUP_RESOURCES environment variable
or the --resources CLI argument (CLI values take precedence).
Both accept a comma-separated list of name:quantity pairs,
e.g. STEPUP_RESOURCES="cpu:4,gpu:1,memgb:16".
Steps declare which resources they need with the resources keyword argument,
which accepts either a dict (e.g. {"gpu": 1}) or a shorthand string (e.g. "gpu:1" or "gpu").
Resources not listed in STEPUP_RESOURCES or --resources are treated as unavailable,
so any step requiring them will never run.
Requesting non-positive quantity of a resource (e.g. resources={"gpu": 0})
is not allowed and will raise an error,
but one can specify a resource with zero quantity in STEPUP_RESOURCES
to make it unavailable (e.g. STEPUP_RESOURCES="gpu:0").
More details can be found in the step() API documentation.
Example¶
Example source files: docs/advanced_topics/resources/
The example illustrates three steps with different resource requirements.
Create the following plan.py:
#!/usr/bin/env python3
from stepup.core.api import run
run("sleep 0.1; echo A", shell=True, resources="cpu")
run("sleep 0.1; echo B", shell=True, resources="gpu")
run("sleep 0.1; echo C", shell=True, resources={"cpu": 2, "gpu": 1})
The sleep command ensures each step lasts long enough to guarantee
they will run in parallel when allowed.
Step C requires 2 CPUs and 1 GPU simultaneously, so it can only start once both A (which holds 1 CPU) and B (which holds 1 GPU) have finished.
Set the environment variable and run StepUp with four parallel jobs:
You should get the following output:
DIRECTOR │ Listening on /tmp/stepup-########/director (StepUp Core 3.2.3.post54)
DIRECTOR │ Setting available resources: cpu:2,gpu:1
STARTUP │ (Re)initialized boot script
PHASE │ build
START │ ./plan.py
SUCCESS │ ./plan.py
START │ sleep 0.1; echo A
START │ sleep 0.1; echo B
SUCCESS │ sleep 0.1; echo A
─────────────────────────────── Standard output ────────────────────────────────
A
────────────────────────────────────────────────────────────────────────────────
SUCCESS │ sleep 0.1; echo B
─────────────────────────────── Standard output ────────────────────────────────
B
────────────────────────────────────────────────────────────────────────────────
START │ sleep 0.1; echo C
SUCCESS │ sleep 0.1; echo C
─────────────────────────────── Standard output ────────────────────────────────
C
────────────────────────────────────────────────────────────────────────────────
DIRECTOR │ Trying to delete 0 outdated output(s)
DIRECTOR │ See you!
Steps A and B start immediately in parallel. Despite allowing 4 steps to run in parallel, step C only starts after both A and B have finished, because it needs 2 CPUs and 1 GPU at the same time.
Try the Following¶
-
Run
sb -j 4again without making changes. Skipping steps requires hash computations, which are done by a dedicated hashing subprocess and are never subject to resource restrictions. -
Change
STEPUP_RESOURCESto"cpu:4,gpu:2"and verify that all three steps can now run in parallel. When you try this, StepUp will continue skipping steps. To forcibly re-execute steps, remove the file.stepup/graph.dband restart StepUp. -
Remove a resource from
STEPUP_RESOURCES(e.g. set it to"cpu:2") and observe that steps requiring the missing resource are never started and remain pending.