Skip to content

Static Deferred Glob

When dealing with massive datasets comprising tens of thousands of files, it doesn’t make sense to render all of them as static when only a handful will actually be utilized. StepUp addresses this issue with the deferred glob feature, which makes previously unknown files static when:

  1. they are used as inputs of a new step and
  2. they match a deferred glob pattern.

Example

Example source files: docs/advanced_topics/static_deferred_glob/

Create two text files with some content: foo.txt and bar.txt, and also the following plan.py:

#!/usr/bin/env python3
from stepup.core.api import glob, graph, runsh

glob("*.txt", _defer=True)
runsh("cat foo.txt", inp="foo.txt")
graph("graph")

Run the plan interactively with StepUp:

chmod +x plan.py
stepup boot -n 1

You should get the following screen output:

  DIRECTOR │ Listening on /tmp/stepup-########/director (StepUp 3.0.0)
   STARTUP │ (Re)initialized boot script
  DIRECTOR │ Launched worker 0
     PHASE │ run
     START │ runpy ./plan.py
   SUCCESS │ runpy ./plan.py
     START │ runsh cat foo.txt
   SUCCESS │ runsh cat foo.txt
─────────────────────────────── Standard output ────────────────────────────────
This is foo.
────────────────────────────────────────────────────────────────────────────────
  DIRECTOR │ Trying to delete 0 outdated output(s)
  DIRECTOR │ Stopping workers
  DIRECTOR │ See you!

As expected, foo.txt is used as a static file. Of course, this would also have been the case without the _defer=True option. The key difference is that with _defer=True, StepUp does not create a list of all matching *.txt files. This can be seen when examining the file graph.txt, which has no trace of bar.txt:

root:
             creates   file:./
             creates   file:plan.py
             creates   step:runpy ./plan.py

file:./
               state = STATIC
          created by   root:
            supplies   file:foo.txt
            supplies   file:plan.py
            supplies   step:runpy ./plan.py
            supplies   step:runsh cat foo.txt

file:plan.py
               state = STATIC
              digest = 93a71952 c0b7ae7f 75e3cc01 2c78c1dd 34c1581b 6b6f6988 7ecc1691 e8a11805
                     = c06cfff6 cac658dd cb03a194 6d8aefe9 8c018060 90ddad78 522a3f82 310a4dcf
          created by   root:
            consumes   file:./
            supplies   step:runpy ./plan.py

step:runpy ./plan.py
               state = RUNNING
          created by   root:
            consumes   file:./
            consumes   file:plan.py
             creates   dg:*.txt
             creates   step:runsh cat foo.txt

dg:*.txt
          created by   step:runpy ./plan.py
             creates   file:foo.txt

step:runsh cat foo.txt
               state = QUEUED
          created by   step:runpy ./plan.py
            consumes   file:./
            consumes   file:foo.txt

file:foo.txt
               state = STATIC
              digest = 0c64fa0d 9b93cfe0 46d049cd 30640438 385cec99 cf27db48 ad87ebb0 0f9d727d
                     = 646e46e6 ded92d12 458876d7 ba4f147d 6401a78e ffb2f12d 0595392c 89cf2784
          created by   dg:*.txt
            consumes   file:./
            supplies   step:runsh cat foo.txt

The node dg:'*.txt; in the graph (green octagon in the figures below) is the result of adding the _defer=True option. This node will create static files as they are needed by other steps. The deferred glob is ideal when there are a large number of files that could match the pattern, of which most are irrelevant for the build. For example, there could be thousands of .txt files in this scenario, but this would not have any effect on the resources consumed by StepUp.

The dependency graph:

graph_dependency.svg

The provenance graph:

graph_provenance.svg

Try the Following

  • When using deferred globs, steps cannot create outputs that match the deferred glob. This would mean that a built file could be made static when used as input later, which is clearly inconsistent. Try causing this error by adding a step copy("foo.txt", "foo2.txt").

  • Remove the _defer=True option and inspect the corresponding graph.txt. You should see that bar.txt is now indeed included in the graph.