Send in your ideas. Deadline June 1, 2024
More info available :
Theme fund: NGI0 Discovery
Start: 2019-08
End: 2022-10
More projects like this
Services + Applications

Search and Displace

Find and redact privacy sensitive information

The goal of this project is to establish a workflow and toolchain which can address the problem of mass search and displacement for document content where the original documents are in a range of forms, including a wide variety of digital document formats, both binary and more modern compressed XML forms, and potentially even encompassing older documents where the only surviving form is printed or even handwritten. The term "displacement" is meant to encompass actions taken on the discovered content that are beyond straight replacement, including content tagging and redaction, as well as more complex contextual and user-refined replacement on an iterative basis. It is assumed that this process will be a server application with documents uploaded as needed, on either an individual or bulk upload basis. The solution would be built in a modular fashion so that future deployments could deploy and/or modify only the parts needed. In practical terms this involves the creation of an open source tool chain that facilitates searching for private and confidential content inside documents, for instance attachments to email messages or documents that are to be published on a website. The tool can subsequently be used for the secure and automated redaction of sensitive documents; by building this as a modular solution enables the solution to be used “standalone” with a simple GUI, or used via command line, or embedded within 3rd party systems such as document management systems, content management systems and machine learning systems. In addition a modular approach will facilitate the use of the solution both with different languages (natural and programming) and different specialities e.g. government archives, winning tenders, legal contracts, court documents etc..

Why does this actually matter to end users?

Everyone knows that once something is online, it can be hard if not impossible to take that information down again. This is especially risky when you need to share information on a document that also has particularly sensitive or even confidential data on it. Considering the amount of documents businesses, organizations and individuals share online everyday, mistakes are inevitable and potentially very harmful, possibly leading to (identity) theft, blackmail, or worse. Search and discovery in this sense is also a matter of privacy protection and granular control, the same way confidential details are sometimes redacted in government documents. This control should also be possible for documents that are already online, when 'the harm is already done' and you are desperately looking for a way to take a file down again or edit out any sensitive details.

This project can give users more control over what information they precisely want to share or publish online in their documents and what should be kept out of the public eye. A tool will be developed that can find out whether private or confidential information is leaked somewhere in the file and subsequently delete or cover up this data. The tool will cover documents in all shapes and sizes, ranging from digital forms and docs to printed files and even handwritten texts, and will be usable standalone or integrated in existing document or content management systems that organizations already use. The project aims to make a modular toolkit so the technology is relevant for all sorts of users, for example people working with government archives, court documents and legal contracts. Instead of forgoing transparency and data accessibility for privacy and confidentiality, this technology upholds both values crucial to a functioning democracy.

Run by Moorcrofts

Logo NLnet: abstract logo of four people seen from above Logo NGI Zero: letterlogo shaped like a tag

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.