Grant
Theme fund: Binary Analysis Fund
Start: 2022-12

Automated clearing of source code files

More efficient retrieval of security and license compliance contextual information

A common task for companies is to clear software source code files for legal or security reasons before they can be used by the software developers. The clearing process is tool driven, using tools such as code clone detectors/snippet matchers, license scanners and security scanners. Typically the clearning process starts from 0 for each new file that is analyzed and the fact that open source software is changed incrementally most of the time, and the software being scanned will likely be nearly identical to previously seen software, is not used. For a (large) subset of files it is possible to use this characteristic to (semi-)automate this process. When scanning a new file, first find a closest file in a set of known files, compute the difference to the known file, checking where the difference in the file is and use rules to determine what action to take depending on where the difference in the file is.

When scanning source code people are typically looking at the file as a whole as an individual unit but never at the lifecycle of the file: how much was changed and where was it changed. For license compliance it makes no sense to rescan files if the header where the license text is found has not been changed and earlier conclusions can be copied. For security it doesn't matter if only comments are changed but no code. This project tries to tackle this by finding out a little bit more about finding a closest match to the code (is there already a file that is close enough), determine the structure of the file (what is comments, what is code) and then comparing the two files to see where changes were made. Depending on the scenario (license compliance or security) different actions can subsequently be taken by the user.