Ghostery doesn't take money to unblock ads like AdBlock Plus. The decision what to block on a site depends on the community lists (e.g. EasyList, FanBoy). If you want to get unblocked in Ghostery, you need to convince the maintainers of the community lists.
That is nice to hear. When we had issues with the Human Web Proxies - and that did not happen often - you were always quick in helping out. Thanks for the great support through the years!
(Disclaimer: I work at Cliqz) We had problems with being blocked in the past.
In cases, where we got a chance to explain, they agree that it is a false positive and took us off the block list. At least, that happened so far in all cases that I'm aware of. However, there are so many lists that it is hard to keep track of them. Would be nice if you could provide some information which block list it is, so we can contact them.
If someone does not want to send Human Web data, the feature can also be disabled through the UI. Same if you browse in a private window; Human Web is automatically disabled there. There is no need to configure blocking rules.
(Disclaimer: I work at Cliqz) Extending on that, let me elaborate why we cannot open the data, not even a subset of it. We had the discussion in the past, but for two reasons it is not an option.
Although it is anonymous data - currently we are not aware of any de-anonymization attacks - it is still data that came from real persons. We have a responsibility: once the data is out, we have to guarantee that no-one will ever be able to identity a single person in the data. Take also in account that attackers can combine multiple data sets (Background Knowledge Attacks); that even includes data sets that will be published (or leaked) in the future.
You should never be too confident when it comes to security, neither should you underestimate the creativity of attackers. What we can do - and did in the past - is to simulate the scenario in a controlled environment by hiring pen testing companies. If they would find an attack, they will not use that knowledge to harm the persons behind the identities that they could reveal.
That is the main reason. We don't want to end up in a situation as AOL or Netflix when they published their data. By the way, Netflix is an example of a background attack where they needed to combine data sources.
There is also another argument. Skeptics will most likely remain skeptics, as we cannot proof that we did not filter out data before publishing. In other words, there is nothing to gain for us, we can only loose. Trust is important, but for building trust, it is better to be transparent about the data that gets sent on the client. You can verify that part yourself and do not have to rely on trust alone. That is the core idea behind our privacy by design approach.
Those are the arguments that I'm aware of why we will not open the data. However, getting access in controlled environments is possible. If you doing security/privacy research, you can reach out to us. In my opinion, having more people that will try to find flaws in our heuristics is useful. That gives us a chance to fix it before it can be used for attacks.
One notable exception: https://whotracks.me is built from Human Web and all its underlying data can be freely downloaded. We know that it has been already used for research.
My take on it: although we do see value in differential privacy, we do not believe it fits well in our particular case. The critical moment is to decide what data should be sent by the client. Once data it is out, it is out. It is not possible to apply anonymization once it is on the server. If someone knows how it can be done safely, I would be highly interested.
We consider our chosen approach - breaking record linkage before sending - safer for our use-case and simpler. Do not underestimate the simplicity argument. Differential privacy is a powerful technique, but it is also very complex; there are lots of pitfalls and it is crucial to make good choices for the parameters.
(Disclaimer: I work at Cliqz) Just read the article. It is from 2017 and very short. For the non-German speakers, I have to translate the relevant part:
> Rund ein Prozent der Firefox-Downloads enthalten künftig das Add-On Cliqz, das bereits beim Eintippen Vorschläge für Webseiten anzeigt. Dafür wertet es die Surf-Aktivitäten aller Nutzer aus.
About 1% of the Firefox downloads will contain the Cliqz Addon, which will show you search suggestions for websites while you type. For that, it uses browsing activities of all users.
---
The last "of all users" is important. Yes, our search is built on data collected from users, but the point is we cannot build profiles of single users; we are only seeing what the whole group of users does. I cannot stress that part enough. We are not Avast.
In fact, we are very open about our data collection system called Human Web:
I can understand that you did not like the way that Mozilla rolled it out in 2017. I'm also not glad about how it went (my personal opinion). But from the technical side, I'm more than happy to take any question on that topic (how we collect data in Cliqz).
(Disclaimer: I work at Clizq) I don't work on the search, but did some work recently on the crawling part. What I know is that crawling is far more difficult if you are not a big player. Sites will quickly block you once you hit a rate limit.
We have to be very careful, since when we get blocked there is normally no way to get unblocked again. You can try to send them an email to unblock you, but it is unlikely that you get a response. This is one part of the explanation why crawling is slow. The other part is more obvious: the internet is large.
The blocking part is hard to overcome as a small player, while for Google it is the opposite as sites simply cannot afford being exclude from the index. If we would not have to care about rate limits, it would simplify the problem.
(disclaimer: I work at Cliqz) Not yet, but I'm literally working on it (or rather taking a break from working on it).
Short answer: The proxy will see your IP but does not share that information with us. To prevent the proxy from reading your content, we need to end-to-end the communication (and prevent statistical attacks based on the size of the encrypted data and so on).
Regarding Tor:
Yes, we experimented with sending through Tor. The main issue is that our code needs to run in a WebExtension, which is a restricted environment. You can only use WebSocket communication but no real TCP sockets. The next blog post in the series has more information and has a link to the code of our experimental Tor client (the native Tor client compiled with WebAssembly to be used in a WebExtension).
I hope the post will address your open questions. If not, you can ask tomorrow about the details of HPN.