Data release, it's not possible, but if people want to come and do experiments on the data or try to test it for privacy, we are more than welcome to host them. There is no formal process in any way, best effort, we have done several times in the past. If you are very interested contact us and we will see if we can accommodate you.
(Disclaimer: I work at Cliqz) Extending on that, let me elaborate why we cannot open the data, not even a subset of it. We had the discussion in the past, but for two reasons it is not an option.
Although it is anonymous data - currently we are not aware of any de-anonymization attacks - it is still data that came from real persons. We have a responsibility: once the data is out, we have to guarantee that no-one will ever be able to identity a single person in the data. Take also in account that attackers can combine multiple data sets (Background Knowledge Attacks); that even includes data sets that will be published (or leaked) in the future.
You should never be too confident when it comes to security, neither should you underestimate the creativity of attackers. What we can do - and did in the past - is to simulate the scenario in a controlled environment by hiring pen testing companies. If they would find an attack, they will not use that knowledge to harm the persons behind the identities that they could reveal.
That is the main reason. We don't want to end up in a situation as AOL or Netflix when they published their data. By the way, Netflix is an example of a background attack where they needed to combine data sources.
There is also another argument. Skeptics will most likely remain skeptics, as we cannot proof that we did not filter out data before publishing. In other words, there is nothing to gain for us, we can only loose. Trust is important, but for building trust, it is better to be transparent about the data that gets sent on the client. You can verify that part yourself and do not have to rely on trust alone. That is the core idea behind our privacy by design approach.
Those are the arguments that I'm aware of why we will not open the data. However, getting access in controlled environments is possible. If you doing security/privacy research, you can reach out to us. In my opinion, having more people that will try to find flaws in our heuristics is useful. That gives us a chance to fix it before it can be used for attacks.
One notable exception: https://whotracks.me is built from Human Web and all its underlying data can be freely downloaded. We know that it has been already used for research.
You are on point. Recency is a challenging problem in multiple ways for search engines. Not just limited to discovering new content, but also how does one index it? How does one balance out when you have for the same query "very new", "new", "slightly old" and "really old" results during ranking. This involves both news as well as new webpages surfacing on the web.
On top of this, we have to remember that this is a fully autonomous real time system which requires solving some of the most difficult engineering challenges at scale and at the same time being mindful of the latency and quality constraints.
At the end of the day, it's all about the final user experience that we ship. We are very much mindful of the same. We will be publishing more details about Cliqz search, on our blog https://0x65.dev/ in the coming days, so stay tuned.
Exactly! It's a high time for everyone to have an open discussion and have some strict implementation around what kind of data collection is considered fair? Search being a complex application as discussed in our previous articles on our blog (https://0x65.dev/) requires data and by that I mean real "Quality" data. When dealing with large datasets, circumventing through all the noise is probably a harder problem than search itself. This is where Human Web shines for us.
Consider this realistic scenario, for a user query, we have to navigate through an index of billions of webpages and come up with at-least 10 relevant results (which fill up page one). On top of that, make sure that the result on first position (top result) is most Relevant or Vital (a very broad and subjective metric) - All of this under fraction of a second. Add to this, the complexity involved dealing with an ever evolving and exploding content on the web, where we have different languages and region locales to deal with.
Through our blog articles, we want to re-emphasize that building an "Independent" search engine is very very complex and a challenging task, specially when we incorporate constraints encountered at our scale and the fact that we want to really abide by the principles of "Privacy by Design".
As with most we too could assume that "X" data point is readily available, let's collect it! But In reality, we always circle back. Literally, ask ourselves ... Does collecting X datapoint violate privacy for our user? If the answer to this is ever a "Yes", the X datapoint is dropped immediately and we DO NOT COLLECT IT! For us, features which rely on that datapoint X are just not deemed worthy!!! If the industry starts to replicate this practice, it would be a great win for Cliqz.
Work on Cliqz Search started way earlier ~2013. Our work on Kubernetes and modernizing our architecture was also started around year 2016.
[1] https://www.verizonmedia.com/press/open-sourcing-vespa-yahoo...