Send in your ideas. Deadline June 1, 2024
logo
Grant
Theme fund: NGI0 Discovery
Start: 2020-08
End: 2022-09
More projects like this
Verticals + Search

Re-isearch

Vectorise text with a flexible unit of retrieval

*Project re-isearch: a novel multimodal search and retrieval engine using mathematical models and algorithms different from the all-too-common inverted index (popularized by Salton in the 1960s). The design allows it to have no limits on the frequency of words, term length, number of fields or complexity of structured data and support even overlap--- where fields or structures cross other's boundaries (common examples are quotes, line/sentences, biblical verse, annotations). Its model enables a completely flexible unit of retrieval and modes of search.

Initial project outcome: a freely available and completely open-source (and multiplatform) C++ library, bindings for other languages (such as Python) and some reference sample code using the library in some of these languages.

Why does this actually matter to end users?

“Re-isearch” is a project following in the spirit of the original isearch developed back in the 1990s. Like the original, it is not just about textual words but the design contains a large number of objects: numerical, range, geospatial etc. It is unique among full-text systems in that it also provides numerous object types with their own methods of search and allows these to be viewed parallel as text--- a date field (of which it will be one of the first to support some key parts of the new ISO-8601:2019 standard date semantics), for instance, can be searched as a date but also a text, searching for the words in the field.

These objects don't even have to be part of any document but may be available via interface glue into other systems via ODBC, CORBA or object embedding. This allows indexing content--- for example from RSS/XML--- to be stored in and searched from other systems. This is useful in many dynamic applications in commerce and trading (keeping live counts of goods on hand, selling prices, etc.). Objects don’t even have to always be explicitly defined as various doctypes (document handlers) can automatically (if enabled, resp. not disabled) at index time detect a number of field data types(such as that something is a telephone number or a date or.. ).

A radical departure from other designs is its concept of search granularity. With typical text indexers one has the concept of document or record and that is the unit of index and the unit of retrieval. Instead we can have a dynamic search time unit of retrieval: user specified or heuristically determined. The structure of of documents can be exploited to identify which document elements (such as the appropriate chapter or page) to retrieve. Retrieval granularity may be on the level of sub-structures of a given document or page such as line, paragraph but may also be as part of a larger collection.

Logo NLnet: abstract logo of four people seen from above Logo NGI Zero: letterlogo shaped like a tag

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.