Theme fund: NGI Zero Core
Start: 2024-04
Software engineering

Tracing and rebuilding packages

Improved metadata/provenance for build artifacts

For many end users the smallest unit of software is the "package": a collection of programs and configuration files bundled in a single file, typically shipped as a single archive. Examples are "util-linux", "glibc", "bash", "ffmpeg" and so on.

Open source distributions install packages using their package management systems. The package management system writes the contents of a package to disk when the package is installed or updated and removes the contents if the package is removed. The packages themselves contain metadata maintained by the distribution maintainers. This information includes the name of the package, project URL, description, dependency information and license information, etc.

This granularity can be too coarse. For example, the license information is aggregated at the package level. If there are separate files that are under different licenses, then this will not always be clear from the license information at the package level.

This project will make it more easy to understand by looking at what goes into each individual binary in a package, and assign metadata to the individual binaries instead of to a package. It will do so by tracing the build of a package and recording which files are actually used. By building packages in a minimal (container) environment, capturing the build trace, processing the build trace to see exactly what goes into which binary it becomes much easier to zoom in and answer specific questions such as "what license does this binary have" or "which binaries use vulnerable file X" and combining it with efforts like VulnerableCode and PurlDB.

