Hacker Newsnew | past | comments | ask | show | jobs | submit | maximeago's commentslogin

Nice way to make differential privacy approachable to everyone with open source. Was it peer-reviewed?


Yes Qrlew is based on a research paper (https://arxiv.org/pdf/2401.06273.pdf) presented at a AAAI 2024 workshop: https://ppai-workshop.github.io/

As you may know, Differential Privacy is hard to implement right. To foster trust, we relied on a two-pronged strategy:

- Open-source

- Peer reviewed methodology

Feel free to reach out to us if you need more details.


Yes, Sarus is on https://www.linkedin.com/company/sarus-technologies, feel free to follow us or add the founders directly.


Privitar and Leapyear are indeed part of competition on the more mature side of the spectrum. Even if all three of us use differential privacy, I would say that each company's core value prop is a bit different:

- Sarus: replaces the manual governance of data access by "no-access". Analysts or data scientists can manipulate data without accessing it. The absence of access means that the process is considerably simplified and no longer relies on many manual decisions and controls. Differential privacy is here as a way to automate protection.

- Privitar: it is a more traditional data governance solution. It is all about controls and manual decisions. In their own works, they feature an "unbeatable breadth of privacy techniques". Differential privacy is one of them. They leave it to the privacy professional to make their own implementation decisions, which is exactly what Sarus offers to disrupt.

- Leapyear: it is a data analysis solution powered by differential privacy. It does not seek to replace existing data governance processes. This is why they don't focus on blending into existing data workflows and only offers differential privacy as an way to access where Sarus can disappear into existing operations without requiring a learning curve on the part of analysts and data scientists.


Thanks a lot, very clear!


You're correct, Sarus never sees the data. The software runs directly on the data infrastructure of the client. It's typically deployed on the public cloud for instance.

And here, of course, differential privacy only guarantees the data protection in the flow of data between the data source and the data practitioner. It should not be a replacement for other best practices like the ones you mention.


If the model training is designed to profile just one user, no, the model won't work by design. What you describe is an attack on the privacy of that user and we do want to make sure they fail.

The way differential privacy works with machine learning is that it guarantees that one given record cannot have a significant impact on the weights of the models and therefore on its performance. In the particular case of SGD-based models, the guarantee holds for every step of the descent. A good place to start on the topic is Abadi 2016 (https://arxiv.org/pdf/1607.00133.pdf).

What is important in the approach is that we don't need to detect that there is something funny in the loss function of the model. Sarus uses the exact same approach whether the model or the loss function is malevolent or not. The guarantees still hold. This is important because a lot of models can extract personal information even with no intention of doing so and no real way to detect it.

A good way to think about model performance is that we are looking for models that perform well irrespective of one record. If there are many users that have the same pattern of the user you are trying to spy on, the model may still be good but you won't know whether it's because of that user or not.


This person just joined last month! ;)


Thanks for catching it! will fix it.


We developed our own generative model for synthetic data generation. It is an autoregressive model where each variable/attribute is derived from previously generated ones using Transformers networks (more details there: https://arxiv.org/pdf/2202.02145.pdf). So yes, correlations are modelled, although exact multicollinearity (when there is a linear relationship between bunch of attributes) would be a bit blurry in the synthetic data.

This being said, the goal of Sarus is to enable analysis on the original data with privacy guarantee on the result (synthetic data is merely used as a tool and a fallback when there is no better solution) so you can write a statistical test to detect multicollinearity and run it on the original data within Sarus.


The product solves the problem of the time it takes to access sensitive data for analytics and machine learning. When you work in a large healthcare or financial organization, each dataset is highly protected. Each time a data practitioner needs to work on it, they may have to wait for months for compliance processes to opine on a data masking strategy and engineering teams to prepare a data lab and implement this strategy. With Sarus, data practitioner no longer need to access data to do analytics or machine learning on sensitive data assets.

When internal access to personal data is not a concern within an organization, data sharing with external partners certainly is. This process can be avoided just the same.

Hence the promise of taking time-to-data form months to minutes.

Hope that helps clarify.


Thanks, this makes it a lot more clear!

Maybe the hero text could be more clear, explaining in summary what it does (similar to this comment). "Get instant access to sensitive data for analytics and machine learning."


That's a great suggestion actually! We'll definitely work on it and thanks for your help.


We developed our own generative model for synthetic data generation. It is an autoregressive model where each variable is derived from previously generated ones using Transformers networks. If you are interested, you have more details in: https://arxiv.org/pdf/2202.02145.pdf When we say it works on any types of data, we mean: numerical, categorical, text, images and compositions of those types (see the paper).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: