Calls:

Send in your ideas. Deadline October 1st, 2020.

 

URL Frontier

[URL Frontier]

Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs. The aim of the URL Frontier project is to develop a crawler-neutral API for the operations that a web crawler when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etcetera. It aims to serve a variety of open source web crawlers, such as StormCrawler, Heritrix and Apache Nutch.

The outcomes of the project are to design a REST API with OpenAPI then provide a set of client APIs using Swagger Codegen as well as a robust reference implementation and a validation suite to check that implementations behave as expected. The code and resources will be made available under Apache License as a sub-project of crawler-commons, a community that focuses on sharing code between crawlers. One of the objectives of URL Frontier is to involve as many actors in the web crawling community as possible and get real users to give continuous feedback on our proposals.

Run by DigitalPebble

Logo NLnet: abstract logo of four people seen from above Logo NGI Zero: letterlogo shaped like a tag

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322. Applications are still open, you can apply today.