Calls: Send in your ideas. Deadline April 1, 2024

Data and AI

Data and AI

This page contains a concise overview of projects funded by NLnet foundation that belong to Data and AI (see the thematic index). There is more information available on each of the projects listed on this page - all you need to do is click on the title or the link at the bottom of the section on each project to read more. If a description on this page is a bit technical and terse, don't despair — the dedicated page will have a more user-friendly description that should be intelligible for 'normal' people as well. If you cannot find a specific project you are looking for, please check the alphabetic index or just search for it (or search for a specific keyword).

AI-VPN — Local machine-based learned analysis of VPN trafffic

Our security decreases significantly especially when we are outside our offices. Current VPNs encrypt our traffic, but they do not protect our devices from attacks or detect if there is an infection. The AI-VPN project proposes a new solution joining the VPN setup with a local AI-based IPS. The AI-VPN implements a state-of-the-art machine learning based Intrusion Prevention System in the VPN, generating alerts and blocking malicious connections automatically. The user is given a summary of the traffic of the device, showing dectected malicious patterns, privacy leaked data and security alerts, in order to protect and educate the users about their security status and any risks they are exposed to.

>> Read more about AI-VPN

OCCRP Aleph disambiguation — OCCRP Aleph: disambiguating different people and companies

Aleph is an investigative data platform that searches and cross-references global databases with leaks and public sources to find evidence of corruption and trace criminal connections. The project will improve the way that Aleph connects data across different data sources and how it ranks recommendations and searches for reporters. Our goal is to establish a feedback loop where users train a machine learning system that will predict if results showing a person or company refer to the same person or company. If successful this means journalists can conduct more efficient research and investigations, finding key information more quickly and wasting less time trawling through irrelevant documents and datasets.

>> Read more about OCCRP Aleph disambiguation

Atomic Data — Typesafe handling of LinkedData

Atomic Data is a modular specification for sharing, modifying and modeling graph data. It uses links to connect pieces of data, and therefore makes it easier to connect datasets to each other - even when these datasets exist on separate machines. Atomic Data is especially suitable for knowledge graphs, distributed datasets, semantic data, p2p applications, decentralized apps and linked open data. It is designed to be highly extensible, easy to use, and to make the process of domain specific standardization as simple as possible. It is type-safe linked data (a strict subset of RDF), which is also fully compatible with regular JSON. In this project, we'll work on the MIT licensed atomic-server and atomic-data-browser, which are a graph database server and a modular web-gui that enable users to model, share and edit atomic data. We'll add functionality, improve stability and testing, improve documentation and create materials that help developers to get started.

>> Read more about Atomic Data

Conzept encyclopedia — An alternative encyclopedia

The Conzept encyclopedia is an attempt to create an encyclopedia for the 21st century. A modern topic-exploration tool based on: Wikipedia, Wikidata, the Open Library, Archive.org, YouTube, the Global Biodiversity Information Facility and many other information sources. A semantic web app build for fun, education and research. Conzept allows you to explore any of the millions of topics on Wikipedia from many different angles - such as science, art, digital books and education - both as a defined semantic entity ("thing") as well as a string. Client-side topic-classification in addition allows for a fast, higher-level logic throughout the whole user experience. Conzept also has an uniquely integrated user-interface, which gives you a single well-designed view of all this information (in any of the 300+ Wikipedia languages), without cognitive overload.

>> Read more about Conzept encyclopedia

Dat Private Network — Private storage in DAT

The dat private network is a self-hosted server that is easy to deploy on cloud or home infrastructure. Key features include a web-based control panel for administration by non-developers, as well as on-disk encryption. These no-knowledge storage services will ensure backup and high availability of distributed datasets, while also providing trust that unauthorized third-parties won’t have access to content.

By creating a turnkey backup solution, we’ll be able to address two of our users’ most pressing questions about dat: who serves my data when I’m offline, and how do I archive and secure important files? The idea for this module came from the community, and reflects a dire need in the storage space -- no-knowledge backup and sync across devices. A properly-designed backup service will provide solutions to both of these questions, and will do so in a privacy-preserving way.

This deliverable will put resources into bringing this work to a production-ready state, primarily through development towards updates that make use of the latest performance and security updates from the dat ecosystem, such as NOISE support. We plan to maintain the socio-technical infrastructure through an open working group that creates updates for the network as it matures.

>> Read more about Dat Private Network

DATALISP — Universal data interchange format using canonical S-expressions

As society moves digital the need for thorough fundamentals becomes more prominent. Datalisp is a laboratory for decentralized collaboration built on a few well understood ideas which imply a certain architecture. The central thesis of datalisp is: "If we agree to use a theoretically sound data interchange format then we will be able to efficiently express increasingly complicated coordination problems", but in order to move the web to a different encoding we will need incentives on our side. A substantial improvement in user experience is needed and we aim to provide it. Ultimately our goal is to give peers the tools they need to protect themselves, and others, by collaboratively measuring the legitimacy of information and locally; by assessing whether data can be trusted as code or whether it requires user attention. Datalisp is the convergence point for all these tools (none of which is named "datalisp") rather than a language, join us in figuring out how to reach it!

>> Read more about DATALISP

Encoding for Robust Immutable Storage (ERIS) — Encrypted and content-addressable data blocks

The Encoding for Robust Immutable Storage (ERIS) is an encoding of content into a set of uniformly sized, encrypted and content-addressed blocks as well as a short identifier (a URN). The content can be reassembled from the encrypted blocks only with this identifier (the read capability). ERIS is a form of content-addressing. The identifier of some encoded content depends on the content itself and is independent of the physical location of where the content is stored (unlike content addressed by URLs). This enables content to be replicated and cached, making systems relying on the content more robust.

Unlike other forms of content-addressing (e.g. IPFS), ERIS encrypts content into uniformly sized blocks for storage and transport. This allows peers without access to the read capability to transport and cache content without being able to read the content. ERIS is defined independent of any specific protocol or application and decouples content from transport and storage layers.

The project will release version 1.0.0 after handling feedback from security audit, provide implementations in popular languages to facilitate wider usage (e.g. C library, JS library on NPM), perform a number of core integrations into various transport and storage layers (e.g. GNUNet, HTTP, CoAP, S3), and deliver Block Storage Management (quotas, garbage collection and synchronization for caching peers).

>> Read more about Encoding for Robust Immutable Storage (ERIS)

Etebase - protocol and encryption enhancements — Redesign EteSync protocol and encryption scheme

Etebase is an open-source and end-to-end encrypted software development kit and backend. Think of it as a tool that developers can use to easily build encrypted applications. Etebase is the new name for the protocol that powers EteSync, an open source, end-to-end encrypted, and privacy respecting sync solution for contacts, calendars, notes, tasks and more across all major platforms.

Many people are well aware of the importance of end-to-end encryption. This is evident by the increasing popularity of end-to-end encrypted messaging applications. However, in today's cloud-based world, there is much more (as important!) information that is just left exposed and unencrypted, without people even realising. Calendar events, tasks, personal notes and location data ("find my phone") are a few such examples. This is why the overarching goal of Etebase is to enable users to end-to-end encrypt all of their data.

While the Etebase protocol served EteSync well, there are a number of improvements that could be made to better support EteSync's current and long-term requirements, as well as enabling other developers to build a variety of encrypted applications.

>> Read more about Etebase - protocol and encryption enhancements

EteSync - iOS application — Encrypted synchronisation for calendars, addressbook, etc

EteSync is an open source, end-to-end encrypted, and privacy respecting sync solution for contacts, calendars and tasks with more data types planned for the future. It's currently supported on Android, the desktop (using a DAV adapter layer) where it seamlessly integrates with existing apps, and on the web for easy access from everywhere.

Many people are well aware of the importance of end-to-end encryption. This is evident by the increasing popularity of end-to-end encrypted messaging applications. However, in today's cloud-based world, there is much more (as important!) information that is just left exposed and unencrypted, without people even realising. Calendar events, tasks, personal notes and location data ("find my phone") are a few such examples. This is why the overarching goal of EteSync is to enable users to end-to-end encrypt all of their data.

The purpose of this project is to create an EteSync iOS client which will seamlessly integrate with rest of the system and let the many currently uncatered for iOS users securely sync their data.

>> Read more about EteSync - iOS application

Explain — Deep search on open educational resources

The Explain project aims to bring open educational resources to the masses. Many disparate locations of learning material exist, but as of yet there isn’t a single place which combines these resources to make them easily discoverable for learners. Using a broad array of deep content metadata extraction techniques developed in conjunction with the Delft University of Technology, the Explain search engine indexes content from a wide variety of sources. With this search engine, learners can then discover the learning material they need through a fine-grained topic search or through uploading their own content (eg. exams, rubrics, excerpts) for which learners require additional educational resources. The project focuses on usability and discoverability of resources.

>> Read more about Explain

First Classify Documents — Categorise different types of official documents

The project summary for this project is not yet available. Please come back soon!

>> Read more about First Classify Documents

Geolexica reverse — Reverse Semantic Search and Ontology Discovery via Machine Learning

Ever forgotten a specific word but could describe its meaning? Internet search engines more than often return unrelated entries. The solution is reverse semantic search: given an input of the meaning of the word (search phrase), provide an output with dictionary words that match the meaning. The key to accurate reverse search lies in the machine’s ability to understand semantics. We employ deep learning approaches in natural language processing (NLP) to enable better comparison of meanings between the search phrases with word definitions. Accuracy will be significantly increased. The project outcome will be employed on Geolexica as a pilot application and testbed for evaluation. The ability to identify entities with similar semantics facilitates ontology discovery in the Semantic Web and in Technical Language Processing (TLP).

>> Read more about Geolexica reverse

In-document search — Interoperable Rich Text Changes for Search

There is a relatively unexplored layer of metadata inside the document formats we use, such as Office documents. This allows to answer queries like: show me all the reports with edits made within a timespan, by a certain user or by a group of users. Or: Show me all the hyperlinks inside documents pointing to a web resource that is about to be moved. Or: list all presentations that contain this copyrighted image. Such embedded information could be better exposed to and used by search engines than is now the case. The project expands the ODF toolkit library to dissect file formats, and will potentially have a very useful side effect of maturing the understanding of document metadata at large and for collaborative editing of documents in particular.

>> Read more about In-document search

Practical Tools to Build the Context Web — Declarative setup of P2P collaboration

In a nutshell, the Perspectives project makes collaboration behaviour reusable, and workflows searchable. It provides the conceptual building blocks for co-operation, laying the groundwork for a federated, fully distributed infrastructure that supports endless varieties of co-operation and reuse. The declarative Perspectives Language allows a model to translate instantly in an application that supports multiple users to contribute to a shared process, each with her own unique perspective.

The project will extend the existing Alpha version of the reference implementation into a solid Beta, with useful models/apps, aspiring to community adoption to further the growth of applications for citizen end users. Furthermore, necessary services such as a model repository will be provided. This will bring Perspectives out of the lab, and into the field. For users, it will provide support in well-known IDE's for the modelling language, providing syntax colouring, go-to definition and autocomplete.

Real life is an endless affair of interlocking activities. Likewise, Perspectives models of services can overlap and build on common concepts, thus forming a federated conceptual space that allows users to move from one service to another as the need arises in a most natural way. Such an infrastructure functions as a map, promoting discovery, decreasing dependency on explicit search. However, rather than being an on-line information source to be searched, such the traditional Yellow Pages, Perspectives models allow their users (individuals and organisations alike) to interact and deal with each other on-line. Supply-demand matching in specific domains (e.g. local transport) integrates readily with such an infrastructure. Other patterns of integrating search with co-operation support form a promising area for further research.

>> Read more about Practical Tools to Build the Context Web

LiberaForms — End tot End Encrypted Forms

Cloud services that offer handling of online forms are widely used by schools, associations, volunteer organisations, civil society, and even families to publish questionnaires and collect the results. While these cloud services (such as Google Forms and Microsoft Forms) can be quite convenient to create forms with, for the constituency which has to fill out these forms such practices can actually be very invasive because forms may not only include personal details such as their name, address, gender or age, but also more intimate questions including medical details, political information and life style background. In many situations there is a power asymmetry between the people creating the form and the users that have to supply the data through that form. Often there is significant time pressure. No wonder that users feel socially coerced to comply and hand over their data, even though they might be perfectly aware that their own data might be used against them.

LiberaForms is a transparent alternative for proprietary online forms that you can easily host yourself. In this project, LIberaForms will add end-to-end encryption with OpenPGP, meaning that the data is encrypted on the client device and only the final recipient of the form data can read it (and not just anyone with access to a server). Also, the team will add real-time collaboration on forms, in case users need to fill out forms together.

>> Read more about LiberaForms

LinkedDataHub — Framework to handle Linked Data at scale

LinkedDataHub is a Knowledge Graph explorer, or in technical terms, a rich Linked Data client combined with a personal RDF dataspace (triplestore). It provides a number of features for end-users: browsing Linked Data, cloning RDF resources to the personal dataspace, searching and querying SPARQL endpoints, creating collections from SPARQL queries, editing remote and local RDF documents, creating and transcluding structured content with visualizations of SPARQL results, charts etc. LinkedDataHub is a standalone product as well as a framework – its data-driven architecture allows extension and customization of at every level from the APIs up to the UI.

We expect LinkedDataHub to become a go-to tool for end-users working with Linked Data and SPARQL: researchers, data scientists, domain experts – regardless of whether they work in the digital humanities, life-sciences or any other domain. We strive to provide an unparalleled Knowledge Graph user experience that is enabled by the RDF stack, with the focus on discovery, exploration and personalization.

>> Read more about LinkedDataHub

LumoSQL — Create more reliable, distributed embedded databases

The most widely-used database (SQLite) is not as reliable as it could be, and is missing essential features like encryption and safe usage in networked environments. Billions of people unknowingly depend on SQLite in their applications for critical tasks throughout the day, and this embedded database is used in many internet applications - including in some core internet and technology infrastructure. This project wants to create a viable alternative ('rip and replace'), using the battle tested LMDB produced by the LDAP community. This effort allow to address a number of other shortcomings, and make many applications more trustworthy and by means of adding cryptography also more private. Given the wide range of use cases and heavy operational demands of this class of embedded databases, a serious effort is needed to execute this plan in a way where users can massively switch. The project will extensively test, and will validate its efforts with a number of critical applications.

>> Read more about LumoSQL

MaDada — Using LinkedData to improve FOI processes

MaDada is a free open source platform that simplifies and opens up the process of access by the general public to data and information held by the French government. Making use of the Freedom Of Information (FOI) law, the platform guides citizens to file requests, but also acts as an open data archive and platform for right-to-know or transparency campaigns, by publishing the whole process : the requests history, the resulting correspondence, and the data obtained through it. Launched in October 2019 by Open Knowledge Foundation France members, MaDada has helped 250+ users make over 1200 FOI requests to French public bodies, and is beginning to play an important role in the right-to-know, need for transparency and open government problems.

MaDada is based on the open source software Alaveteli (https://alaveteli.org), which has been adapted and deployed to more than 25 countries in 20 different languages and jurisdictions. Alaveteli offers efficient functions for users to request and manage FOI requests. The NLnet funding will help the project develop and improve discovery and search features of public bodies on madada.fr and Alaveteli software - for instance, in France alone there are more than 60,000 public authorities. This will take advantage of existing digital commons such as Wikidata, and open standards such as schema.org and DCAT.

>> Read more about MaDada

NEFUSI — NEFUSI: A novel NEuroFUzzy approach for semantic SImilarity assessment

The challenge of determining the degree of semantic similarity between two expressions of a textual nature has become increasingly important in recent times. The great importance it has in many modern computing areas and the latest advances in neural computation have made the solutions better. NEFUSI (which stands for "NEuroFUzzy approach for semantic SImilarity assessment") aims to go a step further with the design and development of a novel neurofuzzy approach for semantic textual similarity based on neural networks and fuzzy logics. We intend to benefit from the outstanding capabilities of the latest neural models to work with text and, at the same time, from the possibilities that fuzzy logic offers to aggregate and decode numerical values in a personalized way. In this way, the project will build an approach intended to effectively determine the degree of semantic similarity of textual expressions with high accuracy in a wide range of scenarios concerning Search and Discovery.

>> Read more about NEFUSI

Nextcloud — Unified and intelligent search within private cloud data

The internet helps people to work, manage, share and access information and documents. Proprietary cloud services from large vendors like Microsoft, Google, Dropbox and others cannot offer the privacy and security guarantees users need. Nextcloud is a 100% open source solution where all information can stay on premise, with the protected users choose themselves. The Nextcloud Search project will solve the last remaining open issue which is unified, convenient and intelligent search and discoverability of data. The goal is to build a powerful but user friendly user interface for search across the entire private cloud. It will be possible to select data date, type, owner, size, keywords, tags and other metadata. The backend will offers indexing and searching of file based content, as well as integrated search for other contents like text chats, calendar entries, contacts, comments and other data. It will integrate with the private search capabilities of Searx. As a result the users will have the same powerful search functionalities they know and like elsewhere, but respecting the privacy of users and strict regulations like the GDPR.

>> Read more about Nextcloud

OpenStreetMap Speed Limits — Infer default speed limits for better quality OpenStreetMap-based routing

OpenStreetMap (OSM) is the worlds largest open geodata set, created and maintained collaboratively by millions of users. Of course there are many other purposes beyond creating a map, for instance finding the best route from A to B. Such usage needs to take into account incomplete data, as coverage of speed limits varies greatly across OSM. Currently, only about 12% of roads in OSM have speed limits set. However, default legal speed limits can often be inferred from other data, such as whether the road is within an urban zone, whether the carriage way is segregated, how many lanes it has, whether it is paved etc.

The goal of this project is to extract the default speed limits for different road and vehicle types for all state legislations, map these to OSM and provide these in a machine-readable form so that it can be consumed by open source routing software such as GraphHopper, Valhalla or OSRM. Further, a reference implementation that interprets this data will be provided.

>> Read more about OpenStreetMap Speed Limits

Personal Food Facts — Privacy protecting personalized information about food

Open Food Facts is a collaborative database containing data on 1 million food products from around the world, in open data. This project will allow users of our website, mobile app and our 100+ mobile apps ecosystem, to get personalized search results (food products that match their personal preferences and diet restrictions based on ingredients, allergens, nutritional quality, vegan and vegetarian products, kosher and halal foods etc.) without sacrificing their privacy and having to send those preferences to us.

>> Read more about Personal Food Facts

PRESC Classifier Copies Package — Implementing Machine Learning Copies as a Means for Black Box Model Evaluation and Remediation

The ubiquitous use over the Internet, and in particular in search engines, of often proprietary black-box machine learning models and APIs in the form of Machine Learning as a Service, makes it very difficult to control and mitigate their potential harmful effects (such as lack of transparency, privacy safeguards, robustness, reusability or fairness). Machine Learning Classifier Copying allows us to build a new model that replicates the decision behaviour of an existing one without the need of knowing its architecture nor having access to the original training data. A suitable copy allows to audit the already deployed model, mitigate its shortcomings, and even introduce improvements, without the need to build a new model from scratch, which requires access to the original data.

This project aims to implement a practical solution of this innovative technique into PRESC, an existing free software tool for the evaluation of machine learning classifiers, so that classifier copies are automated and can be easily created by developers using machine learning, in order to reuse, evaluate, mitigate and improve black-box models, ensure a personal data privacy safeguard into their machine learning models, or for any other application.

>> Read more about PRESC Classifier Copies Package

PeerDB Search — Search for semantic and full-text data

PeerDB Search is an opinionated but flexible open source search system incorporating best practices in search and user interfaces and experience to provide intuitive, fast, and easy to use search over both full-text data and semantic data exposed as facets. The goal of the user interface is to allow users without technical knowledge to easily find results they want, without having to write queries. The system will also allow multiple data sources to be used and merged together. As a demonstration PeerDB will deploy a public instance as a search service for Wikipedia articles and Wikidata data.

>> Read more about PeerDB Search

Poliscoops — Make political news and online debate accessible

PoliFLW is an interactive online platform that allows journalists and citizens to stay informed, and keep up to date with the growing group of political parties and politicians relevant to them - even those whose opinions they don't directly share. The prize-winning polical crowdsourcing platform makes finding hyperlocal, national and European political news relevant to the individual far easier. By aggregating the news political parties share on their websites and social media accounts, PoliFLW is a time-saving and citizen-engagement enhancing tool that brings the internet one step closer to being human-centric. In this project the platform will add the news shared by parties in the European Parliament and national parties in all EU member states. , showcasing what it can mean for access to information in Europe. There will be a built-in translation function, making it easier to read news across country borders. PoliFLW is a collaborative environment that helps to create more societal dialogue and better informed citizens, breaking down political barriers.

>> Read more about Poliscoops

Private Searx — Add private resources to the open source Searx metasearch engine

Searx is a popular meta-search engine letting people query third party services to retrieve results without giving away personal data. However, there are other sources of information stored privately, either on the computers of users themselves or on other machines in the network that are not publically accessible. To share it with others, one could upload the data to a third party hosting service. However, there are many cases in which it is unacceptable to do so, because of privacy reasons (including GPPR) or in case of sensitive or classified information. This issue can be avoided by storing and indexing data on a local server. By adding offline and private engines to searx, users can search not only on the internet, but on their local network from the same user interface. Data can be conveniently available to anyone without giving it away to untrusted services. The new offline engines would let users search in local file system, open source indexers and data bases all from the UI of searx.

>> Read more about Private Searx

PyCM — Evaluate the performance of ML algorithms

The outputs and results of machine learning algorithms are usually in the form of confusion matrices. PyCM is an open source python library for evaluating, quantifying, and reporting the results of machine learning algorithms systematically. PyCM provides a wide range of confusion matrix evaluation metrics to process and evaluate the performance of machine learning algorithms comprehensively. This open source library allows users to compare different algorithms in order to determine the optimal one based on their preferences and priorities. In addition, the evaluation can be reported in different formats. PyCM has been widely used as a standard and reliable post-processing tool in the most reputed open-source AI projects like TensorFlow similary, Google's scaaml, torchbearer, and CLaF.

>> Read more about PyCM

SCION-Pathdiscovery — Secure and reliable decentralized storage platform

With the amount of downloadable resources such as content and software updates available over the Internet increasing year over year, it turns out not all content has someone willing to serve all of it up eternally for free for everyone. And in other cases, the resources concerned are not meant to be public, but do need to be available in a controlled environment. In such situations users and other stakeholders themselves need to provide the necessary capacity and infrastructure in another, collective way.

This of course creates new challenges. Unlike a website you can follow a link to or find through a standard search engine and which you typically only have to vet once for security and trustworthiness, the distributed nature of such a system makes it difficult for users to find the relevant information in a fast and trustworthy manner. One of the essential challenges of information management and retrieval in such a system is the location of data items in a way that the communication complexity remains scalable and a high reliability can be achieved even in case of adversaries. More specifically, if a provider has a particular data item to offer, where shall the information be stored such that a requester can easily find it? Moreover, if a user is interested in a particular information, how does he discover it and how can he quickly find the actual location of the corresponding data item?

The project aims to develop a secure and reliable decentralized storage platform enabling fast and scalable content search and lookup going beyond existing approaches. The goal is to leverage the path-awareness features of the SCION Internet architecture to use network resources efficiently in order to achieve a low search and lookup delay while increasing the overall throughput. The challenge is to select suitable paths considering those performance requirements, and potentially combining them into a multi-path connection. To this end, we aim to design and implement optimal path selection and data placement strategies for a decentralized storage system.

>> Read more about SCION-Pathdiscovery

Geographic tagging of Routing and Forwarding — Geographic tagging and discovery of Internet Routing and Forwarding

SCION is the first clean-slate Internet architecture designed to provide route control, failure isolation, and explicit trust information for end-to-end communication. As a path-based architecture, SCION end-hosts learn about available network path segments, and combine them into end-to-end paths, which are carried in packet headers. By design, SCION offers transparency to end hosts with respect to the path a packet travels through the network. This has numerous applications related to trust, compliance, and also privacy. By better understanding of the geographic and legislative context of a path, users can for instance choose trustworthy paths that best protect their privacy. Or avoid the need for privacy intrusive and expensive CDN's by selecting resources closer to them. SCION is the first to have such a decentralised system offer this kind of transparency and control to users of the network.

>> Read more about Geographic tagging of Routing and Forwarding

SES - SimplyEdit Spaces — SimplyEdit Spaces - collaborative presentations

SimplyPresent allows users to collaboratively create and deliver good looking presentation using CRDT's through Hyper Hyper Space - another project supported by NGI Assure. SimplyPresent is itself based on top of the open source SimplyEdit tool, adding advanced user-friendly presentation features. SimplyPresent allows team members to live edit a presentation and the presenter notes while the presentation is being given, control the presentation from any phone without complicated setup: all that is needed on the presenting system or with remote viewers is a URL which will sync through Hyper Hyper Space.

>> Read more about SES - SimplyEdit Spaces

SOLID Data Workers — Toolkit to ingest data into SOLID

Solid Data Workers is a toolkit to leverage the Solid platform (an open source project led byTim Berners-Lee) into a viable, convenient, open and interoperable alternative to privacy-hungry data silos. The aim is to use Solid as a general purpose storage for all of the user's private information, giving them a linked-data meaning to enrich the personal graph and provide a first-class semantic web experience. The project involves a PHP and a NodeJS implementation of the "Data Workers" toolkit to easy the "semantification" of the data collected from external services (SPARQL queries build, metadata retrieval and storage, relationships creation...), some sample software component to import existing data into the semantic graph and keep it synchronized with back-end sources (primarily: emails and calendars), and a proof-of-concept application to showcase the potentials of the semantic web applied to personal linked data. As Solid may be self-hosted or hosted by third-party providers, Solid Data Workers may be attached to any of those instances and to different back-end services.

>> Read more about SOLID Data Workers

SWH package manager Data Ingestion — Add Package managers to Software Heritage

The project summary for this project is not yet available. Please come back soon!

>> Read more about SWH package manager Data Ingestion

Storing Efficiently Our Software Heritage — Faster retrieval within Software Heritage

Software Heritage (https://www.softwareheritage.org) is the single largest collection of software artifacts in existence. But how do you store this in a way that you can find something fast enough, taking into account that these are billions of files with a huge spread in file sizes? "Storing Efficiently Our Software Heritage" will build a web service that provides APIs to efficiently store and retrieve the 10 billions small objects that today comprise the Software Heritage corpus. It will be the first implementation of the innovative object storage design that was designed early 2021. It has the ability to ingest the SWH corpus in bulk: it makes building search indexes an order of magnitude faster, helps with mirroring etc. The project is the first step to a more ambitious and general purpose undertaking allowing to store, search and mirror hundreds of billions of small objects.

>> Read more about Storing Efficiently Our Software Heritage

SensifAI — AI driven image tagging

Billions of users manually upload their captured videos and images to cloud storages such as Dropbox, Google Drive and Apple iCloud straight from their camera or phone. Their private pictures and video material are subsequently stored unprotected somewhere else on some remote computer, in many cases in another country with quite different legislation. Users depend on the tools from these service providers to browse their archives of often thousands and thousands of videos and photo's in search of some specific image or video of interest. The direct result of this is continuous exposure to cyber threats like extortion and an intrinsic loss of privacy towards the service providers. There is a perfectly valid user-centric approach possible in dealing with such confidential materials, which is to encrypt everything before uploading anything to the internet. At that point the user may be a lot more safe, but from now on would have a hard time locating any specific videos or images in their often very large collection. What if smart algorithms could describe the pictures for you, recognise who is in it and you can store this information and use it to conveniently search and share? This project develops an open source smart-gallery app which uses machine learning to recognize and tag all visual material automatically - and on the device itself. After that, the user can do what she or he wants with the additional information and the original source material. They can save them to local storage, using the tags for easy search and navigation. Or offload the content to the internet in encrypted form, and use the descriptions and tags to navigate this remote content. Either option makes images and videos searchable while fully preserving user privacy.

>> Read more about SensifAI

Software Heritage — Collect, preserve and share the source code of all software ever written

Software Heritage is a non profit, multi-stakeholder initiative with the stated goal to collect, preserve and share the source code of all software ever written, ensuring that current and future generations may discover its precious embedded knowledge. This ambitious mission requires to proactively harvest from a myriad source code hosting platforms over the internet, each one having its own protocol, and coping with a variety of version control systems, each one having its own data model. This project will amongst other help ingest the content of over 250000 open source software projects that use the Mercurial version control system that will be removed from the Bitbucket code hosting platform in June 2020.

>> Read more about Software Heritage

Peer-to-Peer Access to Our Software Heritage — Access Software Heritage data via IPFS DHT

Peer-to-Peer Access to Our Software Heritage (SWH × IPFS) is a project aimed at supporting Software Heritage’s mission to build a universal source code archive and preserve it for future generations by leveraging IPFS’s capabilities to share and replicate the archive inadecentralized, peer-to-peer manner. The project will build a bridge between the existing Software Heritage (SWH) API and the IPFS network to transparently serve native IPFS requests for SWH data. In the short term, this allows users using IPFS to form their own Content Distribution Network for SWH data. Longer term, we hope this will serve as a foundation fora decentralized network of copies that, together, ensure that the loss of no one repository, however large, results in the permanent destruction of any part of our heritage. The end product would be a perfect application of IPFS’s tools and a step in the direction of a decentralized internet services infrastructure.

>> Read more about Peer-to-Peer Access to Our Software Heritage

Solid Application Interoperability

The project summary for this project is not yet available. Please come back soon!

>> Read more about Solid Application Interoperability

Secure User Interfaces (Spritely) — Usability of decentralised social media

Spritely is a project to advance the federated social network by adding richer communication and privacy/security features to the network. This particular sub-project aims to demonstrate how user interfaces can and should play an important role in user security. The core elements necessary for secure interaction are shown through a simple chat interface which integrates a contact list as an easy-to-use implementation of a "petname interface". Information from this contact list is integrated throughout the implementation in such a way that helps reduce phishing risk, aids discovery of meeting other users, and requires no centralized naming authority. As an additional benefit, this project will demonstrate some of the asynchronous network programming features of the Spritely development stack.

>> Read more about Secure User Interfaces (Spritely)

StreetComplete — Fix open geodata with OpenStreetMap

The project will make collecting data for OpenStreetMap easier and more efficient. OpenStreetMap is the best source of information for general purpose search engines that need a geographic data about locations and properties of various objects. The objects vary from cities and other settlements to shops, parks, roads, schools, railways, motorways, forests, beaches etc etc etc. The search engine can use the data to answer queries such as "route to nearest wheelchair accessible greengrocer", "list of national parks near motorways" or "London weather". Full OpenStreetMap dataset is publicly available on an open license and already used for many purposes. Improving OSM increases quality of services using open data rather than proprietary datasets kept as a trade secret by established companies.

>> Read more about StreetComplete

StreetComplete — Collaborative editing in OpenStreetMap

StreetComplete is a mobile app that makes it easy and fun to contribute to OpenStreetMap while on and about. OpenStreetMap is the largest open data community about maps, and the go-to source for free geographic data when doing a location-based search. This project focuses on making the collection of data to be used in a search more powerful and efficient. More specifically, the main goals are to add the possibility to collect more data with an easy interface and to add a new view in which it shall be more efficient to complete and keep up-to-date certain types of data, such as housenumbers or cycleways.

>> Read more about StreetComplete

StreetComplete UX — Improve usability of StreetComplete

OpenStreetMap is the best source of information for general purpose search engines that need a geographic data about locations and properties of various objects. The objects vary from cities and other settlements to shops, parks, roads, schools, railways, motorways, forests, beaches etc etc etc. The search engine can use the data to answer queries such as "route to nearest wheelchair accessible greengrocer", "list of national parks near motorways" or "London weather". Full OpenStreetMap dataset is publicly available on an open license and already used for many purposes.

The project will make collecting open data for OpenStreetMap easier and more efficient, and lower the threshold for contribution by improving usability and accessibility. Any user should be able to help improve OpenStreetMap data, simply by downloading the app from F-droid or Google store and map as they walk.

>> Read more about StreetComplete UX

TypeCell — CRDT-based collaborative block-based editor

TypeCell aims to make software development more open, simple and accessible. TypeCell integrates a live-programming environment as a first-class citizen in an end-user block-based document editor, forming an open source application platform where users can instantly inspect, edit and collaborate on the software they’re using. TypeCell spans a number of different projects improving and building on top of Matrix, Yjs and Prosemirror to advance local-first, distributed and collaborative software for the web.

>> Read more about TypeCell

URL Frontier — Develop a API between web crawler and frontier

Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs. The aim of the URL Frontier project is to develop a crawler-neutral API for the operations that a web crawler when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etcetera. It aims to serve a variety of open source web crawlers, such as StormCrawler, Heritrix and Apache Nutch.

The outcomes of the project are to design a gRPC schema then provide a set of client stubs from the schema as well as a robust reference implementation and a validation suite to check that implementations behave as expected. The code and resources will be made available under Apache License as a sub-project of crawler-commons, a community that focuses on sharing code between crawlers. One of the objectives of URL Frontier is to involve as many actors in the web crawling community as possible and get real users to give continuous feedback on our proposals.

>> Read more about URL Frontier

variation graph (vgteam) — Privacy enhanced search within e.g. genome data sets

Vgteam is pioneering privacy-preserving variation graphs, that allow to capture complex models and aggregate data resources with formal guarantees about the privacy of the individual data sources from which they were constructed. Variation graphs relate collections of sequences together as walks through a graph. They are traditionally applied to genomic data, where they support the compression and query of very large collections of genomes.

But there are many types of sensitive data that can be represented in a variation graph form, including geolocation trajectory data - the trajectories of individuals and vehicles through transportation networks. Epidemiologists can use a public database of personal movement trajectories to for instance do geophylogenetic modeling of a pandemic like SARS-CoV2. The idea is that one cannot see individual movements, but rather large scale flows of people across space that would be essential for understanding the likely places where a outbreak might spread. This is essential information to understand at scientific and political level how to best act in case of a pandemic, now and in the future.

The project will apply formal models of differential privacy to build variation graphs which do not leak information about the individuals whose data was used to construct them. For genomes, the techniques allow us to extend the traditional models to include phenotype and health information, maximizing their utility for biological research and clinical practice without risking the privacy of participants who shared their data to build them. For geolocation trajectory data, people can share data in the knowledge that their social graph is not exposed. The tools themselves are not limited to the above use cases, and open the doors to many other types of applications both online (web browsing histories, social media usage) and offline. .

>> Read more about variation graph (vgteam)

Independent captions and transcript augmentation — Speech-to-text integration for Waasabi

Waasabi is a highly customizable platform for self-hosted video streaming (live broadcast) events. It is provided as a flexible open source web framework that anyone can host and integrate directly into their existing website. By focusing on quick setup, ease of use and customizability Waasabi aims to lower the barrier of entry for hosting custom live streaming events on one's own website, side-stepping the cost, compromises and limitations stemming from using various "batteries-included" offerings, but also removing the hassle of having to build everything from scratch.

In this project the team seeks to integrate tools for transcript augmentation, augmented human captioning and automatic machine-generated captions using open-source software based on machine learning and royalty-free training data and models. The primary use case is live captioning for live internet broadcasts (primarily video streaming). With such tools online event organizers will be able to create interactive transcripts and better live captions for their events anytime everywhere - and without external dependencies.

>> Read more about Independent captions and transcript augmentation

WebXray Discovery — Expose tracking mechanism in search hubs

WebXray intends to build a filter extension for the popular and privacy-friendly meta-search Searx that will show users what third party trackers are used on the sites in their results pages. Full transparency of what tracker is operated by what company is provided to users, who will be able to filter out sites that use particular trackers. This filter tool will be built on the unique ownership database WebXray maintains of tracking companies that collect personal data of website visitors.

Mapping the ownership of tracking companies which sell behavioural profiles of individuals, is critical for all privacy and trust-enhancing technologies. Considerable scrutiny is given to the large players who conduct third party tracking and advertising whilst little scrutiny is given to large numbers of smaller companies who collect and sell unknown volumes of personal data. Such collection is unsolicited, with invisible beneficiaries. The ease and speed of corporate registration provides the opportunity for data brokers to mitigate their liability when collecting data profiles. We must therefore establish a systematic database of data broker domain ownership.

The filter extension that will be the output of the project will make this ownership database visible and actionable to end users, and to curate the crowdsourced data and add it to the current database of ownership (which is already comprehensive, containing detailed information on more than 1,000 ad tech tracking domains).

>> Read more about WebXray Discovery

WikiRate Insights — Transforming WikiRate ESG Platform User Experience to Maximise Reliable Data Insights

For too long actionable data about the behavior of companies has been hidden behind the paywalls of commercial data providers. As a result only those with sufficient resources were able to advocate and shape improvements in corporate practice. Since launching in 2016, WikiRate.org has become the world’s largest open source registry of ESG (Environmental, Social, and Governance) data with nearly 1 million data points for over 55,000 companies. Through the open data platform anyone can systematically gather, analyze and discuss publicly available information on company practices, joining current debates on corporate responsibility and accountability.

By bringing this information together in one place, and making it accessible, comparable and free for all, we aim to provide society with the tools and evidence it needs to spur corporations to respond to the world's social and environmental challenges. Homing in on the usability of the platform, this project will tackle some of the most crucial barriers for users when it comes to gathering and extracting the data, whilst boosting reuse of the open source platform for other purposes.

>> Read more about WikiRate Insights

WikiRate Insights 2 — Dedicated text search architecture for environmental, social and corporate governance platform

The project summary for this project is not yet available. Please come back soon!

>> Read more about WikiRate Insights 2

openEngiadina — Platform for creating, publishing and using open local knowledge

OpenEngiadina is developing a platform for open local knowledge - a mashup between a semantic knowledge base (like Wikipedia) and a social network using the ActivityPub protocol. openEngiadina is being developed with small municipalities and local organizations in mind, and wants to explore the intersection of Linked Data and social networks - a 'semantic social network'.

openEngiadina started off as a platform for creating, publishing and using open local knowledge. The structured data allows for semantic queries and intelligent discovery of information. The ActivityPub protocol enables decentralized creation and federation of such structured data, so that local knowledge can be created by indepent actors in a certain area (e.g. a music association publishes concert location and timing). The project aims to develop a backend allowing such a platform, research ideas into user interfaces and strengthen the ties between the Linked Data and decentralized social networking communities.

>> Read more about openEngiadina

Privacy Preserving Disease Tracking — Research into contact tracing privacy

In case of a pandemic, it makes sense to share data to track the spread of a virus like SARS-CoV2. However, that very same data when gathered in a crude way is potentially very invasive to privacy - and in politically less reliable environments can be used to map out the social graph of individuals and severely threaten civil rights, free press. Unless the whole process is transparent, people might not be easily convinced to collaborate.

The PPDT project is trying to build a privacy preserving contact tracing mechanism that allows to notify users if they have come in contact with potentially infected people. This should happen in a way that is as privacy preserving as possible. We want to have the following properties: the users should be able to learn if they got in touch with infected parties, ideally only that - unless they opt in to share more information. The organisations operating servers should not learn anything besides who is infected, ideally not even that. The project builds a portable library that can be used across different mobile platforms, and a server component to aggregate data and send this back to the participants.

>> Read more about Privacy Preserving Disease Tracking