Send in your ideas. Deadline October 1, 2024
Theme fund: NGI0 Discovery
Start: 2019-08
End: 2022-09
More projects like this
Verticals + Search
Data and AI

variation graph (vgteam)

Privacy enhanced search within e.g. genome data sets

Vgteam is pioneering privacy-preserving variation graphs, that allow to capture complex models and aggregate data resources with formal guarantees about the privacy of the individual data sources from which they were constructed. Variation graphs relate collections of sequences together as walks through a graph. They are traditionally applied to genomic data, where they support the compression and query of very large collections of genomes.

But there are many types of sensitive data that can be represented in a variation graph form, including geolocation trajectory data - the trajectories of individuals and vehicles through transportation networks. Epidemiologists can use a public database of personal movement trajectories to for instance do geophylogenetic modeling of a pandemic like SARS-CoV2. The idea is that one cannot see individual movements, but rather large scale flows of people across space that would be essential for understanding the likely places where a outbreak might spread. This is essential information to understand at scientific and political level how to best act in case of a pandemic, now and in the future.

The project will apply formal models of differential privacy to build variation graphs which do not leak information about the individuals whose data was used to construct them. For genomes, the techniques allow us to extend the traditional models to include phenotype and health information, maximizing their utility for biological research and clinical practice without risking the privacy of participants who shared their data to build them. For geolocation trajectory data, people can share data in the knowledge that their social graph is not exposed. The tools themselves are not limited to the above use cases, and open the doors to many other types of applications both online (web browsing histories, social media usage) and offline. .

Why does this actually matter to end users?

Worries over our health and safety will in many cases take precedence over the perceived value of our privacy. When it comes to our physical health and well-being, we are often in a strongly dependent position. Especially in times of great mental stress (like when a medical doctor breaks bad news to us) or fear (my daughter is late from school, a deadly virus is going round) we often lack the time and knowledge to really consider what data we actually want to make available and under which conditions. Many people in such situations reach a point of detachment and panic, where they hand out whatever data requested from them by whomever promises to resolve the stress. And once data is out there, it is hard to trace back.

But what if we do not have to give up our privacy for the sake of better, and more personalized health care, fighting the spread of a pandemic or other safety measures? What if we can have both? The classic example is genetic research, which can be extremely effective in identifying hereditary diseases and ultimately creating a type of personal health care that perfectly fits your unique needs. It also involves extremely personal and uniquely identifying data (literally the DNA that made us the individuals we are), the wider availability of which has a potential impact on the privacy and physical security of your children and their children and their children's children etcetera. Who know what future generations will have to endure, in good times and in bad times? With the technology easily available to them, would insurance companies, employers or governments be tempted to test for yet undiscovered heart conditions or expensive and rare diseases - or worse? And yet we make important decisions about this in times of stress.

The same caution should go for a pandemic situation like SARS-CoV2. We all want a solution to help those most vulnerable, but as a society we are not prepared at all for the large security implications of exposing geolocation trajectory data seized from for instance telecom networks. And we cannot assume that lack of preparation will not be abused. And despite that we want to have a deep understanding how a virus actually spreads in a fine-grained way from person to person, meaning we need to gain insight in how people move around with actual data. For that purpose epidemiologists really could use access to a public database of personal movement trajectories, so they can do so called geophylogenetic modeling. SARS-CoV2 is not the first virus to cause a pandemic, and it will not be the last - and policy measures like a lockdown have an immense cost in terms of our economy and societal disruption. So we had better test our assumptions and create data sets with privacy preserving variation graphs that allow a wide community of researchers access without risk of security fallout afterwards.

Things do not have to be black and white. Doctors do not need to have access to all of our DNA in order to help us, so we don't have to share everything. Epidemiologists do not need to know Gabriel visited Mary, and how many times Mary met Elisabeth and when and where. As it turns out, there are clever ways to aggregate data in a privacy preserving way, preserve the characteristics needed and removing the rest. This project will build on these so called "variation graphs" to further explore and develop these technologies. There are applications throughout many other use cases as well - variation graphs can be used to produce privacy-preserving representations of collections of other sensitive data, including collections of personal writing, web browsing histories, or even quantified self. The general tenet is always to only share the relevant information, while preventing the identification of individuals.

Variation graphs have huge potential. The project is contributing to various very ambitious goals, such as assisting with the SARS-CoV2 situation and enabling the creation of searchable DNA databases that protect the individuals contributing in a provable way. Input data from healthy and non-healthy people, from sinners and saints, can be transformed in such a way that the privacy of all involved is protected while intensive study of DNA data or human movement patters remains possible. This will greatly help to convince people that they can contribute to the associated research. Of course success in such a critical application breaks the ice for all other use cases, where we see the benefit of big data but also the threats. Such a solution, if it becomes widely available, might be nothing short of revolutionary.

Logo NLnet: abstract logo of four people seen from above Logo NGI Zero: letterlogo shaped like a tag

This project was funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.