Theme fund: NGI0 Discovery
Start: 2021-12
End: 2022-10
URLFrontier provides a crawler-neutral API and service implementation for a crawl frontier, which can power various web crawlers independently from their implementation language and scalability. This API defines the operations that a web crawler typically does when communicating with a web frontier e.g. get the next N URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get stats, etc… The aim of this project is to turn what is currently a working piece of software (the result of an earlier grant from NGI Zero Discovery) into an enterprise-grade solution. The improvements will mainly concern the service implementation, eg. monitoring/reporting, clustering/discovery and robustness/resilience. The project will improve the usability of the system by adding configurable logging and metrics reporting, improve the performance of the service for very large volumes of data by adding efficient parallelization across multiple nodes; and improve the overall robustness through more graceful failure modes and more efficient restarts .

Run by DigitalPebble Ltd

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.