Send in your ideas. Deadline December 1, 2024
logo
Vendor stores
Nix Flake
Stay up to date
Mailinglist
Grant
Theme fund: NGI0 Discovery
Period: 2020-06 — 2020-06
More projects like this
Verticals + Search
Data and AI

URL Frontier

Develop a API between web crawler and frontier

Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs. The aim of the URL Frontier project is to develop a crawler-neutral API for the operations that a web crawler when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etcetera. It aims to serve a variety of open source web crawlers, such as StormCrawler, Heritrix and Apache Nutch.

The outcomes of the project are to design a gRPC schema then provide a set of client stubs from the schema as well as a robust reference implementation and a validation suite to check that implementations behave as expected. The code and resources will be made available under Apache License as a sub-project of crawler-commons, a community that focuses on sharing code between crawlers. One of the objectives of URL Frontier is to involve as many actors in the web crawling community as possible and get real users to give continuous feedback on our proposals.

Why does this actually matter to end users?

Search and discovery are some of the most important and essential use cases of the internet. When you are in school and need to give a presentation or write a paper, when you are looking for a job, trying to promote your business or finding relevant commercial or public services you need, most of the time you will turn to the internet and more importantly the search bar in your browser to find answers. Searching information and making sure your name, company or idea can be discovered is crucial for users, but they actually have little control over this. Search engines decide what results you see, how your website can be discovered and what information is logged about your searches. What filters and algorithms are are used remains opaque for users. They can only follow the rules laid out for them, instead of deciding on their own what, where and how to find the information they are looking for.

Centralizing online search around just a few search engines creates a host of problems, ranging from user privacy and nontransparent filtering to misinformation and fake news. The algorithms of search engines can be misused to show millions of users incorrect and discrediting information and stories about the topic or person they were looking up. This is done to influence elections or to shape the public opinion around specific topics, like refugees and climate change. The reach of these search engines (and the social media networks that are exploited for the same goal) is enormous and once a story goes viral, it is hard if not impossible to take it offline, let alone combat the misinformation with correct reports. At their core, search engines focus on a website's popularity when they filter search results, not information accuracy. All of this creates a perfect storm for fake news to spread incredibly quickly online.

Luckily there are alternative web search solutions that provide a more clear and neutral look on the world. Some of these are powered by open source essential components, like a web crawler, that does what you would think it does: it crawls the web and copy pages for a search engine to process and index. This project will make it easier for search engines to use various web crawlers for specific purposes in a uniform way. This useful building block can help pave the way for a wider diversity of search engines to combat misinformation, echo chambers and monopolies.

Run by DigitalPebble

Logo NLnet: abstract logo of four people seen from above Logo NGI Zero: letterlogo shaped like a tag

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.