Static Deferred Glob¶
When dealing with massive datasets comprising tens of thousands of files, it doesn’t make sense to render all of them as static when only a handful will actually be utilized. StepUp addresses this issue with the deferred glob feature, which makes previously unknown files static when:
- they are used as inputs of a new step and
- they match a deferred glob pattern.
Example¶
Example source files: docs/advanced_topics/static_deferred_glob/
Create two text files with some content: foo.txt
and bar.txt
,
and also the following plan.py
:
#!/usr/bin/env python3
from stepup.core.api import glob, step
from stepup.core.interact import graph
glob("*.txt", _defer=True)
step("cat foo.txt", inp="foo.txt")
graph("graph")
Run the plan interactively with StepUp:
You should get the following screen output:
DIRECTOR │ Listening on /tmp/stepup-########/director (StepUp 2.0.4)
STARTUP │ (Re)initialized boot script
DIRECTOR │ Launched worker 0
PHASE │ run
START │ ./plan.py
SUCCESS │ ./plan.py
START │ cat foo.txt
SUCCESS │ cat foo.txt
─────────────────────────────── Standard output ────────────────────────────────
This is foo.
────────────────────────────────────────────────────────────────────────────────
DIRECTOR │ Trying to delete 0 outdated output(s).
DIRECTOR │ Stopping workers.
DIRECTOR │ See you!
As expected, foo.txt
is used as a static file.
Of course, this would also have been the case without the _defer=True
option.
The key difference is that with _defer=True
,
StepUp does not create a list of all matching *.txt
files.
This can be seen when examining the file graph.txt
, which has no trace of bar.txt
:
root:
creates file:./
creates file:plan.py
creates step:./plan.py
file:plan.py
state = STATIC
digest = e9a2826e 1262a2a9 8b01a7e8 cc5a1e13 b1fd2bb7 5233efc6 14ad6e42 5f4c9ff2
= d79dc863 ccd12e12 e7aa11a3 93d037fc a5cec13f 44f7e0ce 70381bfb 8eb624f6
created by root:
consumes file:./
supplies step:./plan.py
file:./
state = STATIC
created by root:
supplies file:foo.txt
supplies file:plan.py
supplies step:./plan.py
supplies step:cat foo.txt
step:./plan.py
state = RUNNING
created by root:
consumes file:./
consumes file:plan.py
creates dg:*.txt
creates step:cat foo.txt
dg:*.txt
created by step:./plan.py
creates file:foo.txt
file:foo.txt
state = STATIC
digest = 0c64fa0d 9b93cfe0 46d049cd 30640438 385cec99 cf27db48 ad87ebb0 0f9d727d
= 646e46e6 ded92d12 458876d7 ba4f147d 6401a78e ffb2f12d 0595392c 89cf2784
created by dg:*.txt
consumes file:./
supplies step:cat foo.txt
step:cat foo.txt
state = QUEUED
created by step:./plan.py
consumes file:./
consumes file:foo.txt
The node dg:'*.txt;
in the graph (green octagon in the figures below)
is the result of adding the _defer=True
option.
This node will create static files as they are needed by other steps.
The deferred glob is ideal when there are a large number of files that could match the pattern,
of which most are irrelevant for the build.
For example, there could be thousands of .txt
files in this scenario, but
this would not have any effect on the resources consumed by StepUp.
The dependency graph:
The provenance graph:
Try the Following¶
-
When using deferred globs, steps cannot create outputs that match the deferred glob. This would mean that a built file could be made static when used as input later, which is clearly inconsistent. Try causing this error by adding a step
copy("foo.txt", "foo2.txt")
. -
Remove the
_defer=True
option and inspect the correspondinggraph.txt
. You should see thatbar.txt
is now indeed included in the graph.