Send in your ideas. Deadline October 1, 2024
Stay up to date
More info available :
Theme fund: NGI0 Discovery
Start: 2020-04
End: 2022-10
More projects like this
Verticals + Search
Data and AI

In-document search

Interoperable Rich Text Changes for Search

There is a relatively unexplored layer of metadata inside the document formats we use, such as Office documents. This allows to answer queries like: show me all the reports with edits made within a timespan, by a certain user or by a group of users. Or: Show me all the hyperlinks inside documents pointing to a web resource that is about to be moved. Or: list all presentations that contain this copyrighted image. Such embedded information could be better exposed to and used by search engines than is now the case. The project expands the ODF toolkit library to dissect file formats, and will potentially have a very useful side effect of maturing the understanding of document metadata at large and for collaborative editing of documents in particular.

Why does this actually matter to end users?

Searching usually starts with a vague memory, a name or number that is in the back of your mind, some little detail that sticks with you, but unfortunately does not tell you where you need to start your search and how. How humans search and how computers handle your query does not always overlap, which can be frustrating: you end up shouting at your screen.

One of the technologies that can make search and discovery more intuitive is semantic search, which is poetically explained as 'search with meaning'. Essentially, instead of searching for a literal number or letter, semantic search tools better understand the context, location, intent, word variations, and other important points you would imply when typing in a query. You do not know what file you are searching for, but you know it has something to do with your upcoming tax report. Or you cannot remember for the life of you that person's name, but you know who their colleagues are. Semantic search takes your vague plan and scraps of information and instantly knows how everyone is connected, giving you the information you were looking for.

Intuitive semantic search requires rich data structures, for example the information hidden in the documents we make. Not only the metadata that states you saved a text file at a certain time in a particular folder, but also more detailed information and connections like what images in your presentation are copyrighted, what links in your reports do not work anymore, etcetera. This project aims to develop these search capabilities for OpenDocument files, a widely used open document standard. Especially when handling a lot of documents and data, such detailed search can be an incredible time saver, or even provide unique new insights, for example in data-centered research or journalism.

Run by independent freelancer

Logo NLnet: abstract logo of four people seen from above Logo NGI Zero: letterlogo shaped like a tag

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.