WebXray Discovery

Expose tracking mechanism in search hubs

WebXray intends to build a filter extension for the popular and privacy-friendly meta-search Searx that will show users what third party trackers are used on the sites in their results pages. Full transparency of what tracker is operated by what company is provided to users, who will be able to filter out sites that use particular trackers. This filter tool will be built on the unique ownership database WebXray maintains of tracking companies that collect personal data of website visitors.

Mapping the ownership of tracking companies which sell behavioural profiles of individuals, is critical for all privacy and trust-enhancing technologies. Considerable scrutiny is given to the large players who conduct third party tracking and advertising whilst little scrutiny is given to large numbers of smaller companies who collect and sell unknown volumes of personal data. Such collection is unsolicited, with invisible beneficiaries. The ease and speed of corporate registration provides the opportunity for data brokers to mitigate their liability when collecting data profiles. We must therefore establish a systematic database of data broker domain ownership.

The filter extension that will be the output of the project will make this ownership database visible and actionable to end users, and to curate the crowdsourced data and add it to the current database of ownership (which is already comprehensive, containing detailed information on more than 1,000 ad tech tracking domains).

The project's own website: https://webXray.org

Why does this actually matter to end users?

It is a scenario you probably run through several times a day on autopilot, without even noticing you are doing it: you are looking for some specific information, and submit some related key words to an online search engine. The search provider gets a large list of results, applies a set of ranking algorithms to it (pushing back potentially millions of results in favour of a handful of things it decides to push forward) , and you are given a single webpage that holds a shortlist of results together with some adds. Each of these results has a short description and a link to visit the page. A quick glance tells you that a couple of these results seem relevant to what you are looking for.

Normally, you would just click, find what you need or browse around. But what do you actually know about these sites, other than that they contain the same words that you were looking for? Once you've decided to click on a link, you have to reveal yourself to a significant extent to the website operator or its providers - but also to anyone they allow to be present on the website you visit. Your browser by default sends all kinds of fingerprintable information, some of which is unavoidable without active intervention and can be extremely telling (like ones current IP address). This can be combined with the context you are visiting: medical problem A, gossip B, professional interest C.

Many crucial internet and web standards were not designed with user privacy in mind, let alone giving users any sense of control over who can see what they do online. This opportunity for evel has been seized by all sorts of tracking and tracing schemes that make detailed profiles of people, which can then be (mis)used for commercial or even criminal gain. Another thing is to note is that trackers get to run software on your computer. So the minute you enter a web premise, you automatically start downloading all kinds of potentially risky things from around the internet, including payloads from other sites that you never actively chose to interact with. These downloads often include known attack vectors like javascript, which (unless you actively take precautions) are even automatically executed.

Once you are on the web page you could probably search for and read the Terms of Service of the site. This may inform you that these third parties exist and are feasting away on your data, and that each has their own separate Terms of Service you could start looking for yourself. Note also that some sites contain dozens or even hundreds of trackers, which they combine with your context - and depending on what you were looking for, this can be quite telling. So at that point it is already too late.

Through regulations like the GPDR you may have the right to request what information is being captured from you, but in practical terms this is infeasible for normal people to do for every web page they come across looking for something. You just wanted to quickly search for something, remember. You can perhaps accept a company to do some analytics for its own purposes. You did not ask for an exponential exposure to a swarm of tracking companies that sell your data to the internet. In other words, you unknowingly opened a can of worms.

What you really want to know is: are the pages you are about to visit stuffed from top to bottom with hidden trackers, each of them with an unhealthy interest in as much of your online behaviour as you are unable to shield off? Who is actually behind these domain names (a single tracker company may have different web domains it hides under, and these can be changed within minutes. The data they collect remains piled up on a single mountain of observations, though. Who are the companies behind these shady business practices that index detailed information, where are the owners located and what law are they subject to?

How come that we can look through billions of pages of content in fractions of a second looking for any combination of words, but get no clue about what privacy or security we can expect there? And this despite the rather obvious nature of allowing other domains to take a peek on its visitors? A major step to taking back control of our online presence is to map out how our privacy is violated on a website by website basis, and by whom. If you ever used a tracker blocking app or browser extension, you will have seen tens or hundreds of unrecognizable titles and unfamiliar organization names. What if we can show you which of these 'parasitic' actors are where?

This is what the WebXRay Search project aims to do. It continuously runs across the web looking not for content, but for trackers. And it will make this information directly visible to you as you search, so before harm is done. If your current search engine operates trackers itself, it may not have a business interest to deploy this. This is why WebXRay Search will be made available through an extension for the privacy-friendly and customizable Searx meta search engine. This is a privacy-enhancing search proxy you can install yourself and share with others. You get the combined search results from many different sources, and for each of these results you can see what kind of tracking situation you will experience. And you will even be able to just automatically block results that feature the worst offenders. These ethical filters help inform and protect users and their privacy, because who wholly avoiding trackers is even better than using a tracking blockers.

In addition to being a very practical solution for the general audience, the results of the project will also be useful to institutions that enforce privacy legislation like the GDPR. These organisations will be able to visually check which organizations operate outside the law. Journalists or NGO's that research personal data collection can use the tool for their studies. Lets halt unsolicited and unlawful tracking and profiling, so people can just enjoy the web again without too much fear of their privacy.

Run by Webxray

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement N^o 825322.

Navigate projects

Currently open for proposals:

Job openings

NGI Zero is looking for Regional Representatives.

Drop by our office hour

Come by for a chat every last Wednesday of the month at 4 PM CEST in our Matrix room.