Inventory Files¶
StepUp RepRep uses inventory.txt to collect metadata to prepare a ZIP archive.
Inventory files can either be created manually (for an external dataset)
or automatically (for datasets created within a StepUp workflow.)
File Formats¶
Inventory Files (inventory.txt)¶
The inventory.txt file contains one line per file to be archived.
Each line has four fields:
- the file size,
- the file mode,
- the BLAKE2b hash of the file contents (regular file) or link destination (symbolic link), and
- the relative path to the file, starting from the parent of the inventory file.
A fixed column width is used. You do not create this type of file manually. Instead, StepUp RepRep offers tools to make such files. (See below.)
Inventory Definitions (inventory.def)¶
The inventory.def format is inspired by the MANIFEST.in format
from the setuptools project.
As of version 1.2.0, StepUp RepRep no longer relies on setuptools
and has its own implementation to process inventory definitions.
Inventory definition files are processed one line at a time.
Comments start with the # character, just as in Python source code.
Empty lines or lines containing only comments are ignored.
A Non-empty line must contain one command starting with include or exclude.
Each command first generates a new list of paths.
An include command will add the newly generated paths to the inventory,
whereas an exclude command will remove them from the inventory (if present).
Each command operates on the file list constructed by preceding commands.
(The first command starts from an empty list.)
The following rules are supported:
- A bare
includeorexcludeis followed by one or more Named Glob patterns, which may also contain**wildcards to match files recursively. - The
include-gitandexclude-gitusegit ls-filesto generate a list of files. Arguments to this command are optional and are passed togit ls-files. - The
include-workflowandexclude-workflowextract a file list from one or more StepUpgraph.dbfiles. The first argument is the state of the files to be selected. (STATICorBUILTare common. Other states exist but make less sense.) All subsequent arguments are (Named Glob patterns matching) StepUpgraph.dbfiles.
If any of these rules produces directories, they will not be included in the inventory. An exception is raised when no matching paths are found.
Creating inventory.txt Files¶
Command-line Tool stepup make-inventory¶
One can create an inventory file on the command-line as follows:
See stepup make-inventory --help for more details.
This tool is suitable for creating inventory files of external datasets.
StepUp RepRep Function make_inventory¶
One may include a make_inventory() step in plan.py as follows:
from stepup.reprep.api import make_inventory
make_inventory("file1.txt", "file2.txt", "inventory.txt")
This step is mainly useful for creating an inventory file of a dataset generated by your StepUp workflow.
It can also be combined with stepup.core.glob for building datasets from static files,
or one may provide an inventory definition files as follows:
from stepup.reprep.api import make_inventory
make_inventory("inventory.txt", path_def="inventory.def")
Creating a ZIP Archive From a inventory.txt File¶
Command-line Tool stepup zip-inventory¶
Given an inventory.txt file, the corresponding ZIP file is created with:
This is a command-line wrapper around the zip_inventory function discussed below.
StepUp RepRep Function zip_inventory¶
The function zip_inventory() takes a inventory.txt file as input
and creates a ZIP file containing all the files listed in the inventory.
It differs from the conventional zip program in the following ways:
- File hashes are checked before adding files to the archive, which is mostly useful for external datasets.
- The
inventory.txtfile is included in the resulting ZIP file. - The ZIP file is reproducible: all time stamps in the ZIP file are set to January 1st, 1980.
- By default, symbolic links are added as links, instead of the contents they point to.
Unpacking the ZIP file¶
To unpack the ZIP file, use the following command:
This will correctly handle symbolic links, unlike Python’s built-in extractall method. See cpython#82102 for more details on Python’s (lacking) support for symbolic links in ZIP files.
Validate an Inventory File¶
If you just want to check the file sizes, modes and hashes of an inventory, run:
This can be useful in the following cases:
- When you work with a remote dataset, you can check if the files in the inventory have changed.
- When you unpack a ZIP file created
stepup zip-inventoryorzip_inventory(), you can check if the files are not affected by bit rot or other data integrity issues.